fix(agent): don't retry interrupt-induced transport errors (salvage #6600)#41952
Merged
Conversation
…-interrupt hang) When agent.interrupt() fires during an active LLM call, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker thread — the EXPECTED consequence of our own close, not a network bug. The streaming retry loop misclassified it as a transient connection error and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request on shared client state — the root of the multi-minute cascading-interrupt hang reported in the wild. Fix: a request-local _request_cancelled token set by the poll loop right before the force-close, in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call. The worker's exception handler checks the token and exits cleanly — no retry, no fallback, no 'reconnecting' status — instead of treating the forced error as transient. The token is request- local (not agent._interrupt_requested, which is cleared at turn boundaries) so a stale worker outliving its turn still recognizes its own forced close. Original diagnosis and fix by @kristianvast (PR #6600), against the then- inline methods in run_agent.py. Those were since extracted into agent/chat_completion_helpers.py, so the fix is reapplied there. Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-import |
2 |
First entries
tests/agent/test_cascading_interrupt_6600.py:30: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/agent/test_cascading_interrupt_6600.py:29: [unresolved-import] unresolved-import: Cannot resolve imported module `httpx`
✅ Fixed issues: none
Unchanged: 5365 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
An interrupt during an active LLM call no longer triggers a cascade of doomed retries that can hang the agent for minutes. Salvage of #6600 (@kristianvast), adapted onto current
main.Root cause: when
agent.interrupt()fires, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker — the expected consequence of our own close, not a network bug. The streaming retry loop misclassified it as transient and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway cachesAIAgentinstances per session, the stale worker outlived the interrupted turn and raced the next turn's request — the multi-minute cascading-interrupt hang.Changes
agent/chat_completion_helpers.py: add a request-local_request_cancelledtoken in bothinterruptible_api_call(non-streaming) andinterruptible_streaming_api_call.Trueby the poll loop immediately before the force-close.agent._interrupt_requested, which is cleared at turn boundaries), so a stale worker outliving its turn still recognizes its own forced close.tests/agent/test_cascading_interrupt_6600.py: regression suite for the non-streaming path, the request-local property (no cross-request leak), and the negative case (a genuine network error with no interrupt still surfaces).Adaptation note
#6600 was written against the then-inline
_interruptible_api_call/_interruptible_streaming_api_callmethods inrun_agent.py. Those have since been extracted intoagent/chat_completion_helpers.py, so a clean cherry-pick wasn't possible — the fix is reapplied at the new location with the same design. Contributor authorship preserved viaCo-authored-by. Closes #6600.Validation
RemoteProtocolErrorcould surface as network errorInterruptedError, near-instantNote: this is the interrupt-path half of the community "agent stuck running" report. The other half — memory sync blocking the turn-completion path — is #41945.
Infographic