Skip to content

fix(error_classifier): classify generic-typed timeout messages as transient (carve-out of #22664)#22857

Merged
teknium1 merged 1 commit into
mainfrom
salvage/pr-22664-classifier-only
May 10, 2026
Merged

fix(error_classifier): classify generic-typed timeout messages as transient (carve-out of #22664)#22857
teknium1 merged 1 commit into
mainfrom
salvage/pr-22664-classifier-only

Conversation

@teknium1

@teknium1 teknium1 commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Salvage of #22664's classifier portion — _classify_by_message now recognizes timeout-shaped messages on generic exception types (e.g. RuntimeError("claude CLI turn timed out") from a local OpenAI-compatible shim).

Root cause

agent/error_classifier.py::_classify_by_message covered billing / rate_limit / auth / context_overflow / model_not_found patterns but had no timeout-message branch. The type-based check requires isinstance(error, (TimeoutError, ConnectionError, OSError)), so a plain RuntimeError("…timed out") didn't match. Result: timeouts from a local shim fell through to FailoverReason.unknown, surfaced as "Empty response from model", and burned 3 retry slots on the same dead endpoint.

Changes (carve-out from #22664)

  • agent/error_classifier.py: new _TIMEOUT_MESSAGE_PATTERNS list ("timed out", "deadline exceeded", "request timed out", "operation timed out", "upstream timed out", "turn timed out"). _classify_by_message returns FailoverReason.timeout (retryable=True) when any pattern matches.
  • 3 regression tests covering RuntimeError CLI-turn-timeout, request-timed-out, and deadline-exceeded.

Carve-out rationale

The original PR #22664 also bundled:

This carve-out keeps just the title-described classifier work.

Validation

  • 13/13 timeout / timed_out tests pass on the salvage branch.

Follow-up to #22780 — fixes the still-broken classification of generic-typed provider-shim timeouts that #22780's dedup couldn't cover.

…nsient (carve-out of #22664)

RuntimeError('claude CLI turn timed out') from a local OpenAI-compatible
shim was falling through to FailoverReason.unknown, surfacing as 'Empty
response from model' and burning 3 retry slots on the same failing
endpoint. _classify_by_message had no timeout-message branch — only
billing/rate_limit/auth/context_overflow/model_not_found patterns. The
type-based check at line 565 also requires isinstance(error, (TimeoutError,
ConnectionError, OSError)) — a plain RuntimeError doesn't match.

Add _TIMEOUT_MESSAGE_PATTERNS for 'timed out', 'deadline exceeded',
'request timed out', 'operation timed out', 'upstream timed out', 'turn
timed out'. _classify_by_message returns FailoverReason.timeout (retryable=True)
when any pattern matches.

Salvage of #22664's classifier portion. The original PR also bundled a
fallback self-selection guard which is now redundant (already on main
via #22780) plus DeepSeek thinking and session_search fixes that are
their own separate concerns.

Follow-up to #22780 — fixes the still-broken classification of
generic-typed provider-shim timeouts that #22780's dedup didn't cover.
@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: salvage/pr-22664-classifier-only vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7953 on HEAD, 7953 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4201 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 9, 2026
@teknium1 teknium1 merged commit 4f8d8ad into main May 10, 2026
16 of 18 checks passed
@teknium1 teknium1 deleted the salvage/pr-22664-classifier-only branch May 10, 2026 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants