fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace by teknium1 · Pull Request #11907 · NousResearch/hermes-agent

teknium1 · 2026-04-18T02:15:46Z

Summary

Already-running concurrent tools now see the user's interrupt instead of having to run to their own timeout.

Root cause

AIAgent.interrupt() called _set_interrupt(True, self._execution_thread_id) — which only flagged the agent's main thread. Tools running inside _execute_tool_calls_concurrent execute on ThreadPoolExecutor worker threads whose tids are distinct from the agent's, so is_interrupted() inside those tools returned False no matter how many times the gateway called .interrupt(). #10935 closed the gap for pending (not-yet-started) tools via Future.cancel(), but an already-running tool continued until it hit its own timeout.

Changes

run_agent.py: per-agent _tool_worker_threads set + lock. interrupt()/clear_interrupt() fan the per-thread signal out to every tracked worker tid. _run_tool registers on entry / discards on exit, plus an entry-race check for "interrupt arrived between fan-out snapshot and our registration." getattr fallback in interrupt()/clear_interrupt() so existing test stubs built via object.__new__ keep working.
tools/environments/base.py: opt-in _wait_for_process trace (ENTER, per-30s HEARTBEAT with interrupt-state + activity-callback presence, INTERRUPT DETECTED, TIMEOUT, natural EXIT with returncode) behind HERMES_DEBUG_INTERRUPT=1.
tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target tid, _interrupted_threads snapshot) behind the same env flag.
tests/run_agent/test_concurrent_interrupt.py: new regression test runs a polling fake tool in a concurrent worker, calls agent.interrupt() from another thread, asserts is_interrupted() flips to True within 1s and concurrent execution returns in <3s. Second new test guards clear_interrupt() clearing tracked worker bits.

Validation

Target	Before	After
Concurrent worker sees interrupt	runs to tool's own timeout	`is_interrupted() == True` within ~1s
`tests/run_agent/`	762 pass	762 pass (4 relevant new/existing tests included)
`tests/tools/` interrupt+env subset	216 pass	216 pass
Debug trace	none	opt-in via `HERMES_DEBUG_INTERRUPT=1` — off by default

The debug trace is specifically to diagnose the follow-up report shape (a sequential terminal tool also reported stuck on ssh) without needing the user to reproduce locally — next report can ship a log excerpt that points at which branch of _wait_for_process actually fired.

…race interrupt() previously only flagged the agent's _execution_thread_id. Tools running inside _execute_tool_calls_concurrent execute on ThreadPoolExecutor worker threads whose tids are distinct from the agent's, so is_interrupted() inside those tools returned False no matter how many times the gateway called .interrupt() — hung ssh / curl / long make-builds ran to their own timeout. Changes: - run_agent.py: track concurrent-tool worker tids in a per-agent set, fan interrupt()/clear_interrupt() out to them, and handle the register-after-interrupt race at _run_tool entry. getattr fallback for the tracker so test stubs built via object.__new__ keep working. - tools/environments/base.py: opt-in _wait_for_process trace (ENTER, per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1. - tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target tid, set snapshot) behind the same env flag. - tests: new regression test runs a polling tool on a concurrent worker and asserts is_interrupted() flips to True within ~1s of interrupt(). Second new test guards clear_interrupt() clearing tracked worker bits. Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env subset 216 pass.

github-actions · 2026-04-18T02:16:02Z

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

2802:+        f"[urllib.request.urlopen(urllib.request.Request(u, method='DELETE', "

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

…s agent.log AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when quiet_mode=True (the CLI default). This would silently swallow every INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation added in the parent commit — confirmed by running hermes chat -q with the flag and finding zero trace lines in agent.log even though _wait_for_process was clearly executing (subprocess pid existed). Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets its own logger level to INFO at import time, overriding the 'tools' parent-level filter. Scoped to the opt-in case only, so production (quiet_mode default) logs stay quiet as designed. Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes '_wait_for_process ENTER/EXIT' lines to agent.log as expected.

github-actions · 2026-04-18T03:07:03Z

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

2802:+        f"[urllib.request.urlopen(urllib.request.Request(u, method='DELETE', "

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

Tool subprocesses spawned by the local environment backend use os.setsid so they run in their own process group. Before this fix, SIGTERM/SIGHUP to the hermes CLI killed the main thread via KeyboardInterrupt but the worker thread running _wait_for_process never got a chance to call _kill_process — Python exited, the child was reparented to init (PPID=1), and the subprocess ran to its natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM to the agent until manual cleanup). Changes: - cli.py _signal_handler (interactive) + _signal_handler_q (-q mode): route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll loop sees the per-thread interrupt flag and calls _kill_process (os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default 1.5s) gives the worker time to complete its SIGTERM+SIGKILL escalation before KeyboardInterrupt unwinds main. - tools/environments/base.py _wait_for_process: wrap the poll loop in try/except (KeyboardInterrupt, SystemExit) so the cleanup fires even on paths the signal handlers don't cover (direct sys.exit, unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace line when HERMES_DEBUG_INTERRUPT=1. - New regression test: injects KeyboardInterrupt into a running _wait_for_process via PyThreadState_SetAsyncExc, verifies the subprocess process group is dead within 3s of the exception and that KeyboardInterrupt re-raises cleanly afterward. Validation: | Before | After | |---------------------------------------------------------|--------------------| | sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM | dies within 2 s | | No INTERRUPT DETECTED in trace | INTERRUPT DETECTED fires + killing process group | | tests/tools/test_local_interrupt_cleanup | 1/1 pass | | tests/run_agent/test_concurrent_interrupt | 4/4 pass |

github-actions · 2026-04-18T03:35:11Z

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

2890:+        f"[urllib.request.urlopen(urllib.request.Request(u, method='DELETE', "

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

…race (NousResearch#11907) * fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace interrupt() previously only flagged the agent's _execution_thread_id. Tools running inside _execute_tool_calls_concurrent execute on ThreadPoolExecutor worker threads whose tids are distinct from the agent's, so is_interrupted() inside those tools returned False no matter how many times the gateway called .interrupt() — hung ssh / curl / long make-builds ran to their own timeout. Changes: - run_agent.py: track concurrent-tool worker tids in a per-agent set, fan interrupt()/clear_interrupt() out to them, and handle the register-after-interrupt race at _run_tool entry. getattr fallback for the tracker so test stubs built via object.__new__ keep working. - tools/environments/base.py: opt-in _wait_for_process trace (ENTER, per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1. - tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target tid, set snapshot) behind the same env flag. - tests: new regression test runs a polling tool on a concurrent worker and asserts is_interrupted() flips to True within ~1s of interrupt(). Second new test guards clear_interrupt() clearing tracked worker bits. Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env subset 216 pass. * fix(interrupt-debug): bypass quiet_mode logger filter so trace reaches agent.log AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when quiet_mode=True (the CLI default). This would silently swallow every INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation added in the parent commit — confirmed by running hermes chat -q with the flag and finding zero trace lines in agent.log even though _wait_for_process was clearly executing (subprocess pid existed). Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets its own logger level to INFO at import time, overriding the 'tools' parent-level filter. Scoped to the opt-in case only, so production (quiet_mode default) logs stay quiet as designed. Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes '_wait_for_process ENTER/EXIT' lines to agent.log as expected. * fix(cli): SIGTERM/SIGHUP no longer orphans tool subprocesses Tool subprocesses spawned by the local environment backend use os.setsid so they run in their own process group. Before this fix, SIGTERM/SIGHUP to the hermes CLI killed the main thread via KeyboardInterrupt but the worker thread running _wait_for_process never got a chance to call _kill_process — Python exited, the child was reparented to init (PPID=1), and the subprocess ran to its natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM to the agent until manual cleanup). Changes: - cli.py _signal_handler (interactive) + _signal_handler_q (-q mode): route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll loop sees the per-thread interrupt flag and calls _kill_process (os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default 1.5s) gives the worker time to complete its SIGTERM+SIGKILL escalation before KeyboardInterrupt unwinds main. - tools/environments/base.py _wait_for_process: wrap the poll loop in try/except (KeyboardInterrupt, SystemExit) so the cleanup fires even on paths the signal handlers don't cover (direct sys.exit, unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace line when HERMES_DEBUG_INTERRUPT=1. - New regression test: injects KeyboardInterrupt into a running _wait_for_process via PyThreadState_SetAsyncExc, verifies the subprocess process group is dead within 3s of the exception and that KeyboardInterrupt re-raises cleanly afterward. Validation: | Before | After | |---------------------------------------------------------|--------------------| | sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM | dies within 2 s | | No INTERRUPT DETECTED in trace | INTERRUPT DETECTED fires + killing process group | | tests/tools/test_local_interrupt_cleanup | 1/1 pass | | tests/run_agent/test_concurrent_interrupt | 4/4 pass |

teknium1 merged commit 20f2258 into main Apr 18, 2026
5 of 7 checks passed

teknium1 deleted the hermes/hermes-3a6370f8 branch April 18, 2026 03:39

handsdiff mentioned this pull request Apr 18, 2026

fix(terminal): rewrite A && B & to A && { B & } to prevent subshell leak #12207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace#11907

fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace#11907
teknium1 merged 3 commits into
mainfrom
hermes/hermes-3a6370f8

teknium1 commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Apr 18, 2026

Summary

Root cause

Changes

Validation

Uh oh!

github-actions Bot commented Apr 18, 2026

⚠️ Supply Chain Risk Detected

⚠️ WARNING: Outbound network calls (POST/PUT)

Uh oh!

github-actions Bot commented Apr 18, 2026

⚠️ Supply Chain Risk Detected

⚠️ WARNING: Outbound network calls (POST/PUT)

Uh oh!

github-actions Bot commented Apr 18, 2026

⚠️ Supply Chain Risk Detected

⚠️ WARNING: Outbound network calls (POST/PUT)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant