fix(kanban): SIGTERM on worker must terminate the process (closes #28181)#34045
Merged
Conversation
The single-query signal handler in cli.py raises KeyboardInterrupt on SIGTERM/SIGHUP. For interactive 'hermes chat -q' that unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal _wait_for_process, custom plugins, etc.). With KeyboardInterrupt only the main thread unwinds; the non-daemon thread keeps the process alive, the gateway has already restarted, and the dispatcher's _pid_alive check returns True forever — task stuck in 'running' indefinitely. When HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush logging + stdout/stderr, then os._exit(0) instead of raising KeyboardInterrupt. The kernel reclaims the PID immediately, and the existing zombie-state detection in _pid_alive flips the task to crashed on the next dispatcher tick. detect_crashed_workers then re-spawns it on the following tick — no manual recovery needed. A SIGALRM(2s) deadman is armed before the flush so a pathological blocking-I/O flush can't wedge the worker forever. In practice the reporter measured flush in <1ms; the alarm is a failsafe, never the common path. Interactive (non-kanban) chat -q is unchanged — the env-gated branch only fires for dispatcher-spawned workers. Live verification on this machine: - Without HERMES_KANBAN_TASK + non-daemon thread alive: process hangs alive 4+ seconds after SIGTERM. Dispatcher's _pid_alive returns True → task stuck. - With HERMES_KANBAN_TASK + same non-daemon thread: process exits in 0.10s via os._exit(0). Dispatcher reclaims on next tick. Tests: - tests/hermes_cli/test_signal_handler_kanban_worker.py (3 cases): end-to-end subprocess test with a non-daemon thread, HERMES_KANBAN_TASK env, SIGTERM, dispatcher-style _pid_alive check. Plus a source-level invariant test catching future refactors that drop the env-gated exit. - 452/452 kanban tests pass. Co-authored-by: andrewhosf <andrewho.sf@gmail.com>
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-import |
1 |
First entries
tests/hermes_cli/test_signal_handler_kanban_worker.py:31: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
✅ Fixed issues: none
Unchanged: 5028 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #28181 — kanban workers stuck in
runningstate forever after gateway SIGTERM.Summary
The single-query signal handler in
cli.py(_signal_handler_q) raisesKeyboardInterrupton SIGTERM/SIGHUP. For interactivehermes chat -qthat unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal_wait_for_processdrains, custom plugin threads, etc.). WithKeyboardInterruptonly the main thread unwinds; the non-daemon thread keeps the process alive after the gateway restarts, and_pid_alivereturns True forever → task staysrunning→ claim extended forever → manualhermes kanban block <id>required.The fix: when
HERMES_KANBAN_TASKis set (dispatcher-spawned worker), flush logging + stdout/stderr and callos._exit(0)instead of raisingKeyboardInterrupt. The kernel reclaims the PID immediately,_pid_alive's existing zombie-state check flips the task tocrashed, anddetect_crashed_workersre-spawns it on the next tick. No manual recovery needed.Live reproduction + verification
Verified end-to-end with a synthetic worker that holds a non-daemon thread blocked on
threading.Event().wait()(matches the shape of_wait_for_processblocking on a child subprocess):_pid_alive(pid)resultTrueforever (extended claim)False(zombie state detected by existing Linux probe)running(stuck)crashed→ reclaimed → re-spawnedThe repro and the fix-path test both live in
tests/hermes_cli/test_signal_handler_kanban_worker.pyas 3 regression cases.Design notes
os.environ.get("HERMES_KANBAN_TASK"). Interactivehermes chat -q(no env var) keeps the originalKeyboardInterruptbehavior — unchanged. The env var is the canonical "this process is a dispatcher-spawned worker" signal (set inhermes_cli/kanban_db.py:5921).signal.alarm(2)is armed with a handler that callsos._exit(0). Reporter measured flush in <1ms under normal conditions; the alarm exists as a failsafe in case logging's stream handlers have a pathological blocking close. Never fires in practice; pure belt-and-suspenders.logging.shutdown()→sys.stdout.flush()→sys.stderr.flush()→os._exit(0). Final telemetry trace preserved before the kernel reaps.Why not raise KeyboardInterrupt better
Reviewer feedback on the original PR draft (@hanselhansel) flagged operational concerns about
os._exit(0)bypassing Python cleanup. The reporter (@andrewhosf) verified flush + 24h stability before re-proposing. The version landing here matches the validated design: explicit flush + SIGALRM deadman + env-gated to dispatcher-spawned workers only.A "cleaner unwind" alternative (e.g., signal the non-daemon threads to exit, join them, then sys.exit) is genuinely harder because we don't own all the non-daemon thread surface (custom plugins can spawn anything they want). The signal-handler contract in production code has to terminate the process from outside the thread surface, which is what
os._exitdoes.Validation
tests/hermes_cli/test_signal_handler_kanban_worker.py: 3/3 passHERMES_KANBAN_TASKset → process dies in <2sHERMES_KANBAN_TASK→ original KeyboardInterrupt path, unchangedtests/tools/test_kanban_tools.py+tests/hermes_cli/test_kanban_db.py+tests/hermes_cli/test_kanban_core_functionality.py: 449/449 passCredit
Infographic