fix(agent): don't retry interrupt-induced transport errors (salvage #6600) by teknium1 · Pull Request #41952 · NousResearch/hermes-agent

teknium1 · 2026-06-08T08:58:21Z

Summary

An interrupt during an active LLM call no longer triggers a cascade of doomed retries that can hang the agent for minutes. Salvage of #6600 (@kristianvast), adapted onto current main.

Root cause: when agent.interrupt() fires, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker — the expected consequence of our own close, not a network bug. The streaming retry loop misclassified it as transient and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request — the multi-minute cascading-interrupt hang.

Changes

agent/chat_completion_helpers.py: add a request-local _request_cancelled token in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call.
- Set to True by the poll loop immediately before the force-close.
- The worker's exception handler checks it and exits cleanly — no retry, no fallback, no "reconnecting" status — instead of treating the forced error as transient.
- The token is request-local (not agent._interrupt_requested, which is cleared at turn boundaries), so a stale worker outliving its turn still recognizes its own forced close.
tests/agent/test_cascading_interrupt_6600.py: regression suite for the non-streaming path, the request-local property (no cross-request leak), and the negative case (a genuine network error with no interrupt still surfaces).

Adaptation note

#6600 was written against the then-inline _interruptible_api_call / _interruptible_streaming_api_call methods in run_agent.py. Those have since been extracted into agent/chat_completion_helpers.py, so a clean cherry-pick wasn't possible — the fix is reapplied at the new location with the same design. Contributor authorship preserved via Co-authored-by. Closes #6600.

Validation

	Before	After
interrupt mid non-streaming call	forced `RemoteProtocolError` could surface as network error	raises `InterruptedError`, near-instant
interrupt mid streaming call	retry loop burned up to 300s per doomed attempt	exits immediately on cancel
genuine network error (no interrupt)	retried	still retried (unchanged)
interrupt/stream test suites	159 passing	159 passing (+ 3 new)

Note: this is the interrupt-path half of the community "agent stuck running" report. The other half — memory sync blocking the turn-completion path — is #41945.

Infographic

@kristianvast

…-interrupt hang) When agent.interrupt() fires during an active LLM call, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker thread — the EXPECTED consequence of our own close, not a network bug. The streaming retry loop misclassified it as a transient connection error and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request on shared client state — the root of the multi-minute cascading-interrupt hang reported in the wild. Fix: a request-local _request_cancelled token set by the poll loop right before the force-close, in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call. The worker's exception handler checks the token and exits cleanly — no retry, no fallback, no 'reconnecting' status — instead of treating the forced error as transient. The token is request- local (not agent._interrupt_requested, which is cleared at turn boundaries) so a stale worker outliving its turn still recognizes its own forced close. Original diagnosis and fix by @kristianvast (PR #6600), against the then- inline methods in run_agent.py. Those were since extracted into agent/chat_completion_helpers.py, so the fix is reapplied there. Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>

github-actions · 2026-06-08T08:59:13Z

🔎 Lint report: `salvage/cascading-interrupt-6600` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10306 on HEAD, 10304 on base (🆕 +2)

🆕 New issues (2):

Rule	Count
`unresolved-import`	2

First entries

tests/agent/test_cascading_interrupt_6600.py:30: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/agent/test_cascading_interrupt_6600.py:29: [unresolved-import] unresolved-import: Cannot resolve imported module `httpx`

✅ Fixed issues: none

Unchanged: 5365 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Jun 8, 2026

teknium1 merged commit dd0d122 into main Jun 8, 2026
30 of 31 checks passed

teknium1 deleted the salvage/cascading-interrupt-6600 branch June 8, 2026 09:19

teknium1 mentioned this pull request Jun 8, 2026

fix(agent,gateway): voice interrupts + cascading interrupt hang #6600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): don't retry interrupt-induced transport errors (salvage #6600)#41952

fix(agent): don't retry interrupt-induced transport errors (salvage #6600)#41952
teknium1 merged 1 commit into
mainfrom
salvage/cascading-interrupt-6600

teknium1 commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented Jun 8, 2026

Summary

Changes

Adaptation note

Validation

Infographic

Uh oh!

github-actions Bot commented Jun 8, 2026

🔎 Lint report: salvage/cascading-interrupt-6600 vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `salvage/cascading-interrupt-6600` vs `origin/main`