Classify Anthropic's 529 status code as "rate_limit" so model fallback
triggers reliably without depending on fragile message-based detection.
Closes#28502
When an OAuth auth profile returns HTTP 403 with permission_error
(e.g. expired plan), the error was not matched by the authPermanent
patterns. This caused the profile to receive only a short cooldown
instead of being disabled, so the gateway kept retrying the same
broken profile indefinitely.
Add "permission_error" and "not allowed for this organization" to
the authPermanent error patterns so these errors trigger the longer
billing/auth_permanent disable window and proper profile rotation.
Closes#31306
Made-with: Cursor
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.
Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.
Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).
Fixes#20999
* Changelog: add failover 502/503/504 note with credits
* Failover: classify HTTP 504 as transient in message parser
* Changelog: credit taw0002 and vincentkoc for failover fix
---------
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>