Skip to content

fix(agent): don't retry interrupt-induced transport errors (salvage #6600)#41952

Merged
teknium1 merged 1 commit into
mainfrom
salvage/cascading-interrupt-6600
Jun 8, 2026
Merged

fix(agent): don't retry interrupt-induced transport errors (salvage #6600)#41952
teknium1 merged 1 commit into
mainfrom
salvage/cascading-interrupt-6600

Conversation

@teknium1

@teknium1 teknium1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

An interrupt during an active LLM call no longer triggers a cascade of doomed retries that can hang the agent for minutes. Salvage of #6600 (@kristianvast), adapted onto current main.

Root cause: when agent.interrupt() fires, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker — the expected consequence of our own close, not a network bug. The streaming retry loop misclassified it as transient and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request — the multi-minute cascading-interrupt hang.

Changes

  • agent/chat_completion_helpers.py: add a request-local _request_cancelled token in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call.
    • Set to True by the poll loop immediately before the force-close.
    • The worker's exception handler checks it and exits cleanly — no retry, no fallback, no "reconnecting" status — instead of treating the forced error as transient.
    • The token is request-local (not agent._interrupt_requested, which is cleared at turn boundaries), so a stale worker outliving its turn still recognizes its own forced close.
  • tests/agent/test_cascading_interrupt_6600.py: regression suite for the non-streaming path, the request-local property (no cross-request leak), and the negative case (a genuine network error with no interrupt still surfaces).

Adaptation note

#6600 was written against the then-inline _interruptible_api_call / _interruptible_streaming_api_call methods in run_agent.py. Those have since been extracted into agent/chat_completion_helpers.py, so a clean cherry-pick wasn't possible — the fix is reapplied at the new location with the same design. Contributor authorship preserved via Co-authored-by. Closes #6600.

Validation

Before After
interrupt mid non-streaming call forced RemoteProtocolError could surface as network error raises InterruptedError, near-instant
interrupt mid streaming call retry loop burned up to 300s per doomed attempt exits immediately on cancel
genuine network error (no interrupt) retried still retried (unchanged)
interrupt/stream test suites 159 passing 159 passing (+ 3 new)

Note: this is the interrupt-path half of the community "agent stuck running" report. The other half — memory sync blocking the turn-completion path — is #41945.

Infographic

cascading-interrupt-fix

…-interrupt hang)

When agent.interrupt() fires during an active LLM call, the main poll loop
force-closes the worker-local httpx client to stop token generation. That
raises a transport error (RemoteProtocolError) on the worker thread — the
EXPECTED consequence of our own close, not a network bug.

The streaming retry loop misclassified it as a transient connection error
and retried; each doomed retry stalled for the full stream-stale timeout
(up to 300s). Because the gateway caches AIAgent instances per session, the
stale worker outlived the interrupted turn and raced the next turn's request
on shared client state — the root of the multi-minute cascading-interrupt
hang reported in the wild.

Fix: a request-local _request_cancelled token set by the poll loop right
before the force-close, in both interruptible_api_call (non-streaming) and
interruptible_streaming_api_call. The worker's exception handler checks the
token and exits cleanly — no retry, no fallback, no 'reconnecting' status —
instead of treating the forced error as transient. The token is request-
local (not agent._interrupt_requested, which is cleared at turn boundaries)
so a stale worker outliving its turn still recognizes its own forced close.

Original diagnosis and fix by @kristianvast (PR #6600), against the then-
inline methods in run_agent.py. Those were since extracted into
agent/chat_completion_helpers.py, so the fix is reapplied there.

Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: salvage/cascading-interrupt-6600 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10306 on HEAD, 10304 on base (🆕 +2)

🆕 New issues (2):

Rule Count
unresolved-import 2
First entries
tests/agent/test_cascading_interrupt_6600.py:30: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/agent/test_cascading_interrupt_6600.py:29: [unresolved-import] unresolved-import: Cannot resolve imported module `httpx`

✅ Fixed issues: none

Unchanged: 5365 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Jun 8, 2026
@teknium1 teknium1 merged commit dd0d122 into main Jun 8, 2026
30 of 31 checks passed
@teknium1 teknium1 deleted the salvage/cascading-interrupt-6600 branch June 8, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants