Summary
When a Kanban worker process receives SIGTERM (from gateway restart, launchd/systemd cgroup cleanup, enforce_max_runtime, or _terminate_reclaimed_worker), the single-query signal handler (_signal_handler_q in cli.py) calls _agent.interrupt() and raises KeyboardInterrupt — but the Python process does not exit cleanly. It remains in the process table as a zombie (<defunct> on macOS).
The dispatchers detect_crashed_workers / release_stale_claims check os.kill(pid, 0) which returns True for zombie processes (they still have a PID table entry). The dispatcher thinks the worker is still alive and keeps extending the claim forever. The task remains running indefinitely and never gets re-dispatched.
Root Cause
In cli.py lines 14144–14158, _signal_handler_q is registered for SIGTERM and SIGHUP in single-query mode (chat -q). When a Kanban worker receives the signal:
_signal_handler_q calls _agent.interrupt(...) and sleeps for the grace window
- Raises
KeyboardInterrupt
- The agent loop dies but the process stays alive as a zombie
_pid_alive() uses os.kill(pid, 0) which returns True even for zombies
- The dispatcher extends the claim forever — task stuck permanently
Impact
- Kanban tasks stuck in
running state forever
- Downstream dependent tasks never execute
- Manual recovery required
- Affects all gateway-managed kanban setups, especially macOS with launchd
Steps to Reproduce
- Set up kanban with
dispatch_in_gateway: true
- Run
hermes gateway restart while a kanban worker is running
- Observe: the worker process becomes
<defunct>, task stays running forever
Proposed Fix
The signal handler should check for HERMES_KANBAN_TASK env var and call block_task() to release the claim before dying. Fix being tested locally.
Workaround
hermes kanban block <task_id> "Worker was interrupted — manual recovery"
Summary
When a Kanban worker process receives SIGTERM (from gateway restart, launchd/systemd cgroup cleanup,
enforce_max_runtime, or_terminate_reclaimed_worker), the single-query signal handler (_signal_handler_qincli.py) calls_agent.interrupt()and raisesKeyboardInterrupt— but the Python process does not exit cleanly. It remains in the process table as a zombie (<defunct>on macOS).The dispatchers
detect_crashed_workers/release_stale_claimscheckos.kill(pid, 0)which returnsTruefor zombie processes (they still have a PID table entry). The dispatcher thinks the worker is still alive and keeps extending the claim forever. The task remainsrunningindefinitely and never gets re-dispatched.Root Cause
In
cli.pylines 14144–14158,_signal_handler_qis registered forSIGTERMandSIGHUPin single-query mode (chat -q). When a Kanban worker receives the signal:_signal_handler_qcalls_agent.interrupt(...)and sleeps for the grace windowKeyboardInterrupt_pid_alive()usesos.kill(pid, 0)which returnsTrueeven for zombiesImpact
runningstate foreverSteps to Reproduce
dispatch_in_gateway: truehermes gateway restartwhile a kanban worker is running<defunct>, task staysrunningforeverProposed Fix
The signal handler should check for
HERMES_KANBAN_TASKenv var and callblock_task()to release the claim before dying. Fix being tested locally.Workaround