Commit Graph

118 Commits

Author SHA1 Message Date
JiangNan
e9e8b81939 fix(failover): classify Gemini MALFORMED_RESPONSE as retryable timeout (#42292)
Merged via squash.

Prepared head SHA: 68f106ff49fc7a28a806601bc8ca1e5e77c6e8c6
Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 20:34:32 +03:00
CryUshio
8bf64f219a fix: recognize Poe 402 'used up your points' as billing for fallback (#42278)
Merged via squash.

Prepared head SHA: f3cdfa76dd9afcb023504eef723036e826e6ebc5
Co-authored-by: CryUshio <30655354+CryUshio@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 20:17:36 +03:00
alan blount
c9a6c542ef Add HTTP 499 to transient error codes for model fallback (#41468)
Merged via squash.

Prepared head SHA: 0053bae14038e6df9264df364d1c9aa83d5b698e
Co-authored-by: zeroasterisk <23422+zeroasterisk@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-10 01:55:10 +03:00
gambletan
8a20f51460 fix: add rate limit patterns for 'too many tokens' and 'tokens per day' (#39377)
Merged via squash.

Prepared head SHA: 132a45728694053c0e3220e7d861508524f17244
Co-authored-by: gambletan <266203672+gambletan@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-08 13:03:33 +03:00
Peter Lee
92648f9ba9 fix(agents): broaden 402 temporary-limit detection and allow billing cooldown probe (#38533)
Merged via squash.

Prepared head SHA: 282b9186c6f48fcdbf0c81c49f739e5e9ed2df23
Co-authored-by: xialonglee <22994703+xialonglee@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-08 10:27:01 +03:00
Altay
6e962d8b9e fix(agents): handle overloaded failover separately (#38301)
* fix(agents): skip auth-profile failure on overload

* fix(agents): note overload auth-profile fallback fix

* fix(agents): classify overloaded failures separately

* fix(agents): back off before overload failover

* fix(agents): tighten overload probe and backoff state

* fix(agents): persist overloaded cooldown across runs

* fix(agents): tighten overloaded status handling

* test(agents): add overload regression coverage

* fix(agents): restore runner imports after rebase

* test(agents): add overload fallback integration coverage

* fix(agents): harden overloaded failover abort handling

* test(agents): tighten overload classifier coverage

* test(agents): cover all-overloaded fallback exhaustion

* fix(cron): retry overloaded fallback summaries

* fix(cron): treat HTTP 529 as overloaded retry
2026-03-07 01:42:11 +03:00
Xinhua Gu
01b20172b8 fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484) (#36802)
* fix(failover): classify HTTP 402 as rate_limit when payload indicates usage limit (#30484)

Some providers (notably Anthropic Claude Max plan) surface temporary
usage/rate-limit failures as HTTP 402 instead of 429. Before this change,
all 402s were unconditionally mapped to 'billing', which produced a
misleading 'run out of credits' warning for Max plan users who simply
hit their usage window.

This follows the same pattern introduced for HTTP 400 in #36783: check
the error message for an explicit rate-limit signal before falling back
to the default status-code classification.

- classifyFailoverReasonFromHttpStatus now returns 'rate_limit' for 402
  when isRateLimitErrorMessage matches the payload text
- Added regression tests covering both the rate-limit and billing paths
  on 402

* fix: narrow 402 rate-limit matcher to prevent billing misclassification

The original implementation used isRateLimitErrorMessage(), which matches
phrases like 'quota exceeded' that legitimately appear in billing errors.

This commit replaces it with a narrow, 402-specific matcher that requires
BOTH retry language (try again/retry/temporary/cooldown) AND limit
terminology (usage limit/rate limit/organization usage).

Prevents misclassification of errors like:
'HTTP 402: exceeded quota, please add credits' -> billing (not rate_limit)

Added regression test for the ambiguous case.

---------

Co-authored-by: Val Alexander <bunsthedev@gmail.com>
2026-03-06 03:45:36 -06:00
zhouhe-xydt
a65d70f84b Fix failover for zhipuai 1310 Weekly/Monthly Limit Exhausted (#33813)
Merged via squash.

Prepared head SHA: 3dc441e58de48913720cf7b6137fa761758d8344
Co-authored-by: zhouhe-xydt <265407618+zhouhe-xydt@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-06 12:04:09 +03:00
Altay
49acb07f9f fix(agents): classify insufficient_quota 400s as billing (#36783) 2026-03-06 01:17:48 +03:00
jiangnan
029c473727 fix(failover): narrow service-unavailable to require overload indicator (#32828) (#36646)
Merged via squash.

Prepared head SHA: 46fb4306127972d7635f371fd9029fbb9baff236
Co-authored-by: jnMetaCode <12096460+jnMetaCode@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-06 00:01:57 +03:00
Altay
f014e255df refactor(agents): share failover HTTP status classification (#36615)
* fix(agents): classify transient failover statuses consistently

* fix(agents): preserve legacy failover status mapping
2026-03-05 23:50:36 +03:00
不做了睡大觉
8ac7ce73b3 fix: avoid false global rate-limit classification from generic cooldown text (#32972)
Merged via squash.

Prepared head SHA: 813c16f5afce415da130a917d9ce9f968912b477
Co-authored-by: stakeswky <64798754+stakeswky@users.noreply.github.com>
Co-authored-by: altaywtf <9790196+altaywtf@users.noreply.github.com>
Reviewed-by: @altaywtf
2026-03-05 22:58:21 +03:00
Kai
60a6d11116 fix(embedded): classify model_context_window_exceeded as context overflow, trigger compaction (#35934)
Merged via squash.

Prepared head SHA: 20fa77289c80b2807a6779a3df70440242bc18ca
Co-authored-by: RealKai42 <44634134+RealKai42@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
2026-03-05 11:30:24 -08:00
Tak Hoffman
9889c6da53 Runtime: stabilize tool/run state transitions under compaction and backpressure
Synthesize runtime state transition fixes for compaction tool-use integrity and long-running handler backpressure.

Sources: #33630, #33583

Co-authored-by: Kevin Shenghui <shenghuikevin@gmail.com>
Co-authored-by: Theo Tarr <theodore@tarr.com>
2026-03-03 21:25:32 -06:00
Gustavo Madeira Santana
e4b4486a96 Agent: unify bootstrap truncation warning handling (#32769)
Merged via squash.

Prepared head SHA: 5d6d4ddfa620011e267d892b402751847d5ac0c3
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-03-03 16:28:38 -05:00
Peter Steinberger
6472e03412 refactor(agents): share failover error matchers 2026-03-03 02:51:00 +00:00
AI南柯(KingMo)
30ab9b2068 fix(agents): recognize connection errors as retryable timeout failures (#31697)
* fix(agents): recognize connection errors as retryable timeout failures

## Problem

When a model endpoint becomes unreachable (e.g., local proxy down,
relay server offline), the failover system fails to switch to the
next candidate model. Errors like "Connection error." are not
classified as retryable, causing the session to hang on a broken
endpoint instead of falling back to healthy alternatives.

## Root Cause

Connection/network errors are not recognized by the current failover
classifier:
- Text patterns like "Connection error.", "fetch failed", "network error"
- Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text)

While `failover-error.ts` handles these as error codes (err.code),
it misses them when they appear as plain text in error messages.

## Solution

Extend timeout error patterns to include connection/network failures:

**In `errors.ts` (ERROR_PATTERNS.timeout):**
- Text: "connection error", "network error", "fetch failed", etc.
- Regex: /\beconn(?:refused|reset|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i

**In `failover-error.ts` (TIMEOUT_HINT_RE):**
- Same patterns for non-assistant error paths

## Testing

Added test cases covering:
- "Connection error."
- "fetch failed"
- "network error: ECONNREFUSED"
- "ENOTFOUND" / "EAI_AGAIN" in message text

## Impact

- **Compatibility:** High - only expands retryable error detection
- **Behavior:** Connection failures now trigger automatic fallback
- **Risk:** Low - changes are additive and well-tested

* style: fix code formatting for test file
2026-03-03 02:37:23 +00:00
Peter Steinberger
1bd20dbdb6 fix(failover): treat stop reason error as timeout 2026-03-03 01:05:24 +00:00
Peter Steinberger
a2fdc3415f fix(failover): handle unhandled stop reason error 2026-03-03 01:05:24 +00:00
bmendonca3
a6489ab5e9 fix(agents): cap openai-completions tool call ids to provider-safe format (#31947)
Co-authored-by: bmendonca3 <bmendonca3@users.noreply.github.com>
2026-03-02 18:08:20 +00:00
Sid
40e078a567 fix(auth): classify permission_error as auth_permanent for profile fallback (#31324)
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.

Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.

Closes #31306

Made-with: Cursor

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-03-01 22:26:05 -08:00
Charles Dusek
92199ac129 fix(agents): unblock gpt-5.3-codex API-key routing and replay (#31083)
* fix(agents): unblock gpt-5.3-codex API-key replay path

* fix(agents): scope OpenAI replay ID rewrites per turn

* test: fix nodes-tool mock typing and reformat telegram accounts
2026-03-02 03:45:12 +00:00
Frank Yang
ed86252aa5 fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090)
* fix: handle CLI session expired errors gracefully

- Add session_expired to FailoverReason type
- Add isCliSessionExpiredErrorMessage to detect expired CLI sessions
- Modify runCliAgent to retry with new session when session expires
- Update agentCommand to clear expired session IDs from session store
- Add proper error handling to prevent gateway crashes on expired sessions

Fixes #30986

* fix: add session_expired to AuthProfileFailureReason and missing log import

* fix: type cli-runner usage field to match EmbeddedPiAgentMeta

* fix: harden CLI session-expiry recovery handling

* build: regenerate host env security policy swift

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-03-02 01:11:05 +00:00
Peter Steinberger
81ca309ee6 fix(agents): land #31002 from @yfge
Co-authored-by: yfge <geyunfei@gmail.com>
2026-03-02 01:08:58 +00:00
Peter Steinberger
250f9e15f5 fix(agents): land #31007 from @HOYALIM
Co-authored-by: Ho Lim <subhoya@gmail.com>
2026-03-02 01:06:00 +00:00
Aleksandrs Tihenko
c0026274d9 fix(auth): distinguish revoked API keys from transient auth errors (#25754)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a200644284e11adae76368adab40c5fa4e
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-25 19:47:16 -05:00
Peter Steinberger
d2597d5ecf fix(agents): harden model fallback failover paths 2026-02-25 03:46:34 +00:00
Peter Steinberger
43f318cd9a fix(agents): reduce billing false positives on long text (#25680)
Land PR #25680 from @lairtonlelis.
Retain explicit status/code/http 402 detection for oversized structured payloads.

Co-authored-by: Ailton <lairton@telnyx.com>
2026-02-25 01:22:17 +00:00
Peter Machona
9ced64054f fix(auth): classify missing OAuth scopes as auth failures (#24761) 2026-02-24 03:33:44 +00:00
Clawborn
544809b6f6 Add Chinese context overflow patterns to isContextOverflowError (#22855)
Proxy providers returning Chinese error messages (e.g. Chinese LLM
gateways) use patterns like '上下文过长' or '上下文超出' that are not
matched by the existing English-only patterns in isContextOverflowError.
This prevents auto-compaction from triggering, leaving the session stuck.

Add the most common Chinese proxy patterns:
- 上下文过长 (context too long)
- 上下文超出 (context exceeded)
- 上下文长度超 (context length exceeds)
- 超出最大上下文 (exceeds maximum context)
- 请压缩上下文 (please compress context)

Chinese characters are unaffected by toLowerCase() so check the
original message directly.

Closes #22849
2026-02-23 10:54:24 -05:00
Vincent Koc
4f340b8812 fix(agents): avoid classifying reasoning-required errors as context overflow (#24593)
* Agents: exclude reasoning-required errors from overflow detection

* Tests: cover reasoning-required overflow classification guard

* Tests: format reasoning-required endpoint errors
2026-02-23 10:38:49 -05:00
Alice Losasso
652099cd5c fix: correctly identify Groq TPM limits as rate limits instead of context overflow (#16176)
Co-authored-by: Howard <dddabtc@users.noreply.github.com>
2026-02-23 10:32:53 -05:00
青雲
69692d0d3a fix: detect additional context overflow error patterns to prevent leak to user (#20539)
* fix: detect additional context overflow error patterns to prevent leak to user

Fixes #9951

The error 'input length and max_tokens exceed context limit: 170636 +
34048 > 200000' was not caught by isContextOverflowError() and leaked
to users via formatAssistantErrorText()'s invalidRequest fallback.

Add three new patterns to isContextOverflowError():
- 'exceed context limit' (direct match)
- 'exceeds the model\'s maximum context'
- max_tokens/input length + exceed + context (compound match)

These are now rewritten to the friendly context overflow message.

* Overflow: add regression tests and changelog credits

* Update CHANGELOG.md

* Update pi-embedded-helpers.isbillingerrormessage.test.ts

---------

Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 10:03:56 -05:00
Peter Steinberger
9bd04849ed fix(agents): detect Kimi model-token-limit overflows
Co-authored-by: Danilo Falcão <danilo@falcao.org>
2026-02-23 12:44:23 +00:00
taw0002
3c57bf4c85 fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017)
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes #20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-23 03:01:57 -05:00
Peter Steinberger
382fe8009a refactor!: remove google-antigravity provider support 2026-02-23 05:20:14 +01:00
青雲
3dfee78d72 fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595) (#23698)
* fix: sanitize tool call IDs in agent loop for Mistral strict9 format (#23595)

Mistral requires tool call IDs to be exactly 9 alphanumeric characters
([a-zA-Z0-9]{9}). The existing sanitizeToolCallIdsForCloudCodeAssist
mechanism only ran on historical messages at attempt start via
sanitizeSessionHistory, but the pi-agent-core agent loop's internal
tool call → tool result cycles bypassed that path entirely.

Changes:
- Wrap streamFn (like dropThinkingBlocks) so every outbound request
  sees sanitized tool call IDs when the transcript policy requires it
- Replace call_${Date.now()} in pendingToolCalls with a 9-char hex ID
  generated from crypto.randomBytes
- Add Mistral tool call ID error pattern to ERROR_PATTERNS.format so
  the error is correctly classified for retry/rotation

* Changelog: document Mistral strict9 tool-call ID fix

---------

Co-authored-by: echoVic <AkiraVic@outlook.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-22 13:37:12 -05:00
Artale
51e9c54f09 fix(agents): skip bootstrap files with undefined path (#22698)
* fix(agents): skip bootstrap files with undefined path

buildBootstrapContextFiles() called file.path.replace() without checking
that path was defined. If a hook pushed a bootstrap file using 'filePath'
instead of 'path', the function threw TypeError and crashed every agent
session — not just the misconfigured hook.

Fix: add a null-guard before the path.replace() call. Files with undefined
path are skipped with a warning so one bad hook can't take down all agents.

Also adds a test covering the undefined-path case.

Fixes #22693

* fix: harden bootstrap path validation and report guards (#22698) (thanks @arosstale)

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-02-22 13:17:07 +01:00
Vignesh Natarajan
35fe33aa90 Agents: classify Anthropic api_error internal server failures for fallback 2026-02-21 19:22:16 -08:00
Harry Cui Kepler
ffa63173e0 refactor(agents): migrate console.warn/error/info to subsystem logger (#22906)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: a806c4cb2700564096ce8980a8d7f839f8a0d388
Co-authored-by: Kepler2024 <166882517+Kepler2024@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-21 17:11:47 -05:00
niceysam
5e423b596c fix: remove false-positive billing error rewrite on normal assistant text (openclaw#17834) thanks @niceysam
Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini

Co-authored-by: niceysam <256747835+niceysam@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
2026-02-21 12:17:39 -06:00
mudrii
7ecfc1d93c fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 2dee8e1174e637e50d10bf7020f1de2990b804dc
Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com>
Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com>
Reviewed-by: @obviyus
2026-02-20 16:01:09 +05:30
Protocol Zero
2af3415fac fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086)
* fix: treat HTTP 503 as failover-eligible for LLM provider errors

When LLM SDKs wrap 503 responses, the leading "503" prefix is lost
(e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a
numeric prefix). The existing isTransientHttpError only matches
messages starting with "503 ...", so these wrapped errors silently
skip failover — no profile rotation, no model fallback.

This patch closes that gap:

- resolveFailoverReasonFromError: map HTTP status 503 → rate_limit
  (covers structured error objects with a status field)
- ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable",
  "high demand" (covers message-only classification when the leading
  status prefix is absent)

Existing isTransientHttpError behavior is unchanged; these additions
are complementary and only fire for errors that previously fell
through unclassified.

* fix: address review feedback — drop /\b503\b/ pattern, add test coverage

- Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the
  semantic inconsistency noted by reviewers: `isTransientHttpError`
  already handles messages prefixed with "503" (→ "timeout"), so a
  redundant overloaded pattern would classify the same class of errors
  differently depending on message formatting.

- Keep "service unavailable" and "high demand" patterns — these are the
  real gap-fillers for SDK-rewritten messages that lack a numeric prefix.

- Add test case for JSON-wrapped 503 error body containing "overloaded"
  to strengthen coverage.

* fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError)

resolveFailoverReasonFromError previously mapped status 503 → "rate_limit",
while the string-based isTransientHttpError mapped "503 ..." → "timeout".

Align both paths: structured {status: 503} now also returns "timeout",
matching the existing transient-error convention. Both reasons are
failover-eligible, so runtime behavior is unchanged.

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-02-19 12:45:09 -08:00
青雲
3d4ef56044 fix: include provider and model name in billing error message (#20510)
Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 40dbdf62e8952dd6c5afcb9ce2a73199f3f532a6
Co-authored-by: echoVic <16428813+echoVic@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-02-18 21:56:00 -05:00
Peter Steinberger
1934eebbf0 refactor(agents): dedupe lifecycle send assertions and stable payload stringify 2026-02-18 14:15:14 +00:00
Peter Steinberger
b8b43175c5 style: align formatting with oxfmt 0.33 2026-02-18 01:34:35 +00:00
Peter Steinberger
31f9be126c style: run oxfmt and fix gate failures 2026-02-18 01:29:02 +00:00
Peter Steinberger
b05e89e5e6 fix(agents): make image sanitization dimension configurable 2026-02-18 00:54:20 +01:00
cpojer
d0cb8c19b2 chore: wtf. 2026-02-17 13:36:48 +09:00
Sebastian
ed11e93cf2 chore(format) 2026-02-16 23:20:16 -05:00