fix(kanban): SIGTERM on worker must terminate the process (closes #28181) by teknium1 · Pull Request #34045 · NousResearch/hermes-agent

teknium1 · 2026-05-28T18:35:47Z

Fixes #28181 — kanban workers stuck in running state forever after gateway SIGTERM.

Summary

The single-query signal handler in cli.py (_signal_handler_q) raises KeyboardInterrupt on SIGTERM/SIGHUP. For interactive hermes chat -q that unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal _wait_for_process drains, custom plugin threads, etc.). With KeyboardInterrupt only the main thread unwinds; the non-daemon thread keeps the process alive after the gateway restarts, and _pid_alive returns True forever → task stays running → claim extended forever → manual hermes kanban block <id> required.

The fix: when HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush logging + stdout/stderr and call os._exit(0) instead of raising KeyboardInterrupt. The kernel reclaims the PID immediately, _pid_alive's existing zombie-state check flips the task to crashed, and detect_crashed_workers re-spawns it on the next tick. No manual recovery needed.

Live reproduction + verification

Verified end-to-end with a synthetic worker that holds a non-daemon thread blocked on threading.Event().wait() (matches the shape of _wait_for_process blocking on a child subprocess):

Scenario	Without fix	With fix
SIGTERM + non-daemon thread alive	hangs 4+s (process alive, dispatcher polling)	dies in 0.10s (os._exit)
Dispatcher's `_pid_alive(pid)` result	`True` forever (extended claim)	`False` (zombie state detected by existing Linux probe)
Task state on next dispatcher tick	still `running` (stuck)	`crashed` → reclaimed → re-spawned
Manual recovery required	YES	NO

The repro and the fix-path test both live in tests/hermes_cli/test_signal_handler_kanban_worker.py as 3 regression cases.

Design notes

Env-gated: the os._exit path is gated specifically on os.environ.get("HERMES_KANBAN_TASK"). Interactive hermes chat -q (no env var) keeps the original KeyboardInterrupt behavior — unchanged. The env var is the canonical "this process is a dispatcher-spawned worker" signal (set in hermes_cli/kanban_db.py:5921).
SIGALRM deadman: before flushing, signal.alarm(2) is armed with a handler that calls os._exit(0). Reporter measured flush in <1ms under normal conditions; the alarm exists as a failsafe in case logging's stream handlers have a pathological blocking close. Never fires in practice; pure belt-and-suspenders.
Flush order: logging.shutdown() → sys.stdout.flush() → sys.stderr.flush() → os._exit(0). Final telemetry trace preserved before the kernel reaps.
Non-daemon-thread invariant preserved: we don't change anything about how kanban workers spawn threads. Workers can continue to use non-daemon threads freely; the signal handler's job is to ensure SIGTERM still kills the process even when those threads exist.

Why not raise KeyboardInterrupt better

Reviewer feedback on the original PR draft (@hanselhansel) flagged operational concerns about os._exit(0) bypassing Python cleanup. The reporter (@andrewhosf) verified flush + 24h stability before re-proposing. The version landing here matches the validated design: explicit flush + SIGALRM deadman + env-gated to dispatcher-spawned workers only.

A "cleaner unwind" alternative (e.g., signal the non-daemon threads to exit, join them, then sys.exit) is genuinely harder because we don't own all the non-daemon thread surface (custom plugins can spawn anything they want). The signal-handler contract in production code has to terminate the process from outside the thread surface, which is what os._exit does.

Validation

tests/hermes_cli/test_signal_handler_kanban_worker.py: 3/3 pass
- End-to-end SIGTERM with HERMES_KANBAN_TASK set → process dies in <2s
- End-to-end SIGTERM without HERMES_KANBAN_TASK → original KeyboardInterrupt path, unchanged
- Source-level invariant test catching future refactors that drop the env-gated exit
tests/tools/test_kanban_tools.py + tests/hermes_cli/test_kanban_db.py + tests/hermes_cli/test_kanban_core_functionality.py: 449/449 pass

Credit

@andrewhosf — identified the bug, diagnosed the KeyboardInterrupt-vs-non-daemon-thread root cause, designed the os._exit fix concept with flush + SIGALRM deadman, ran 24h stability validation. Co-authored trailer on the commit.
@hanselhansel — operational-risk review of the original draft (state corruption, telemetry loss, stability metrics) that drove the flush-before-exit + deadman design that's actually shipping here.

Infographic

The single-query signal handler in cli.py raises KeyboardInterrupt on SIGTERM/SIGHUP. For interactive 'hermes chat -q' that unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal _wait_for_process, custom plugins, etc.). With KeyboardInterrupt only the main thread unwinds; the non-daemon thread keeps the process alive, the gateway has already restarted, and the dispatcher's _pid_alive check returns True forever — task stuck in 'running' indefinitely. When HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush logging + stdout/stderr, then os._exit(0) instead of raising KeyboardInterrupt. The kernel reclaims the PID immediately, and the existing zombie-state detection in _pid_alive flips the task to crashed on the next dispatcher tick. detect_crashed_workers then re-spawns it on the following tick — no manual recovery needed. A SIGALRM(2s) deadman is armed before the flush so a pathological blocking-I/O flush can't wedge the worker forever. In practice the reporter measured flush in <1ms; the alarm is a failsafe, never the common path. Interactive (non-kanban) chat -q is unchanged — the env-gated branch only fires for dispatcher-spawned workers. Live verification on this machine: - Without HERMES_KANBAN_TASK + non-daemon thread alive: process hangs alive 4+ seconds after SIGTERM. Dispatcher's _pid_alive returns True → task stuck. - With HERMES_KANBAN_TASK + same non-daemon thread: process exits in 0.10s via os._exit(0). Dispatcher reclaims on next tick. Tests: - tests/hermes_cli/test_signal_handler_kanban_worker.py (3 cases): end-to-end subprocess test with a non-daemon thread, HERMES_KANBAN_TASK env, SIGTERM, dispatcher-style _pid_alive check. Plus a source-level invariant test catching future refactors that drop the env-gated exit. - 452/452 kanban tests pass. Co-authored-by: andrewhosf <andrewho.sf@gmail.com>

github-actions · 2026-05-28T18:36:37Z

🔎 Lint report: `hermes/hermes-4f9b6288` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9545 on HEAD, 9544 on base (🆕 +1)

🆕 New issues (1):

Rule	Count
`unresolved-import`	1

First entries

tests/hermes_cli/test_signal_handler_kanban_worker.py:31: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues: none

Unchanged: 5028 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

teknium1 merged commit f30db14 into main May 28, 2026
25 checks passed

teknium1 deleted the hermes/hermes-4f9b6288 branch May 28, 2026 19:00

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 28, 2026

teknium1 mentioned this pull request May 28, 2026

[Bug]: Kanban workers stuck in zombie state after SIGTERM — claim never released, task blocked forever #28181

Closed

BrewTestBot mentioned this pull request May 29, 2026

hermes-agent 2026.5.29 Homebrew/homebrew-core#285204

Merged

1 task

github-actions Bot mentioned this pull request May 29, 2026

chore: bump NousResearch/hermes-agent version from v2026.5.16 to v2026.5.29 Docker-Hub-sirmark/docker-hermes-agent#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): SIGTERM on worker must terminate the process (closes #28181)#34045

fix(kanban): SIGTERM on worker must terminate the process (closes #28181)#34045
teknium1 merged 1 commit into
mainfrom
hermes/hermes-4f9b6288

teknium1 commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 28, 2026

Summary

Live reproduction + verification

Design notes

Why not raise KeyboardInterrupt better

Validation

Credit

Infographic

Uh oh!

github-actions Bot commented May 28, 2026

🔎 Lint report: hermes/hermes-4f9b6288 vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `hermes/hermes-4f9b6288` vs `origin/main`