* fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484)
Some providers (notably Anthropic Claude Max plan) surface temporary
usage/rate-limit failures as HTTP 402 instead of 429. Before this change,
all 402s were unconditionally mapped to 'billing', which produced a
misleading 'run out of credits' warning for Max plan users who simply
hit their usage window.
This follows the same pattern introduced for HTTP 400 in #36783: check
the error message for an explicit rate-limit signal before falling back
to the default status-code classification.
- classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402
when isRateLimitErrorMessage matches the payload text
- Added regression tests covering both the rate-limit and billing paths
on 402
* fix: narrow 402 rate-limit matcher to prevent billing misclassification
The original implementation used isRateLimitErrorMessage(), which matches
phrases like 'quota exceeded' that legitimately appear in billing errors.
This commit replaces it with a narrow, 402-specific matcher that requires
BOTH retry language (try again/retry/temporary/cooldown) AND limit
terminology (usage limit/rate limit/organization usage).
Prevents misclassification of errors like:
'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit)
Added regression test for the ambiguous case.
---------
Co-authored-by: Val Alexander <bunsthedev@gmail.com>
Synthesize runtime state transition fixes for compaction tool-use integrity and long-running handler backpressure.
Sources: #33630, #33583
Co-authored-by: Kevin Shenghui <shenghuikevin@gmail.com>
Co-authored-by: Theo Tarr <theodore@tarr.com>
* fix(agents): recognize connection errors as retryable timeout failures
## Problem
When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.
## Root Cause
Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)
While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.
## Solution
Extend timeout error patterns to include connection/network failures:
**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i
**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths
## Testing
Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text
## Impact
- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested
* style: fix code formatting for test file
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.
Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.
Closes#31306
Made-with: Cursor
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
* fix: handle CLI session expired errors gracefully
- Add session_expired to FailoverReason type
- Add isCliSessionExpiredErrorMessage to detect expired CLI sessions
- Modify runCliAgent to retry with new session when session expires
- Update agentCommand to clear expired session IDs from session store
- Add proper error handling to prevent gateway crashes on expired sessions
Fixes#30986
* fix: add session_expired to AuthProfileFailureReason and missing log import
* fix: type cli-runner usage field to match EmbeddedPiAgentMeta
* fix: harden CLI session-expiry recovery handling
* build: regenerate host env security policy swift
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Proxy providers returning Chinese error messages (e.g. Chinese LLM
gateways) use patterns like '上下文过长' or '上下文超出' that are not
matched by the existing English-only patterns in isContextOverflowError.
This prevents auto-compaction from triggering, leaving the session stuck.
Add the most common Chinese proxy patterns:
- 上下文过长 (context too long)
- 上下文超出 (context exceeded)
- 上下文长度超 (context length exceeds)
- 超出最大上下文 (exceeds maximum context)
- 请压缩上下文 (please compress context)
Chinese characters are unaffected by toLowerCase() so check the
original message directly.
Closes#22849
* fix: detect additional context overflow error patterns to prevent leak to user
Fixes#9951
The error 'input length and max_tokens exceed context limit: 170636 +
34048 > 200000' was not caught by isContextOverflowError() and leaked
to users via formatAssistantErrorText()'s invalidRequest fallback.
Add three new patterns to isContextOverflowError():
- 'exceed context limit' (direct match)
- 'exceeds the model\'s maximum context'
- max_tokens/input length + exceed + context (compound match)
These are now rewritten to the friendly context overflow message.
* Overflow: add regression tests and changelog credits
* Update CHANGELOG.md
* Update pi-embedded-helpers.isbillingerrormessage.test.ts
---------
Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.
Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.
Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).
Fixes#20999
* Changelog: add failover 502/503/504 note with credits
* Failover: classify HTTP 504 as transient in message parser
* Changelog: credit taw0002 and vincentkoc for failover fix
---------
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
* fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595)
Mistral requires tool call IDs to be exactly 9 alphanumeric characters
([a-zA-Z0-9]{9}). The existing sanitizeToolCallIdsForCloudCodeAssist
mechanism only ran on historical messages at attempt start via
sanitizeSessionHistory, but the pi-agent-core agent loop's internal
tool call → tool result cycles bypassed that path entirely.
Changes:
- Wrap streamFn (like dropThinkingBlocks) so every outbound request
sees sanitized tool call IDs when the transcript policy requires it
- Replace call_${Date.now()} in pendingToolCalls with a 9-char hex ID
generated from crypto.randomBytes
- Add Mistral tool call ID error pattern to ERROR_PATTERNS.format so
the error is correctly classified for retry/rotation
* Changelog: document Mistral strict9 tool-call ID fix
---------
Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
* fix(agents): skip bootstrap files with undefined path
buildBootstrapContextFiles() called file.path.replace() without checking
that path was defined. If a hook pushed a bootstrap file using 'filePath'
instead of 'path', the function threw TypeError and crashed every agent
session — not just the misconfigured hook.
Fix: add a null-guard before the path.replace() call. Files with undefined
path are skipped with a warning so one bad hook can't take down all agents.
Also adds a test covering the undefined-path case.
Fixes#22693
* fix: harden bootstrap path validation and report guards (#22698) (thanks @arosstale)
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix: treat HTTP 503 as failover-eligible for LLM provider errors
When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.
This patch closes that gap:
- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
(covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
"high demand" (covers message-only classification when the leading
status prefix is absent)
Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.
* fix: address review feedback — drop /\b503\b/ pattern, add test coverage
- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
semantic inconsistency noted by reviewers: `isTransientHttpError`
already handles messages prefixed with "503" (→ "timeout"), so a
redundant overloaded pattern would classify the same class of errors
differently depending on message formatting.
- Keep "service unavailable" and "high demand" patterns — these are the
real gap-fillers for SDK-rewritten messages that lack a numeric prefix.
- Add test case for JSON-wrapped 503 error body containing "overloaded"
to strengthen coverage.
* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)
resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".
Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.
---------
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>