Skip to content

fix(kanban): SIGTERM on worker must terminate the process (closes #28181)#34045

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-4f9b6288
May 28, 2026
Merged

fix(kanban): SIGTERM on worker must terminate the process (closes #28181)#34045
teknium1 merged 1 commit into
mainfrom
hermes/hermes-4f9b6288

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Fixes #28181 — kanban workers stuck in running state forever after gateway SIGTERM.

Summary

The single-query signal handler in cli.py (_signal_handler_q) raises KeyboardInterrupt on SIGTERM/SIGHUP. For interactive hermes chat -q that unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal _wait_for_process drains, custom plugin threads, etc.). With KeyboardInterrupt only the main thread unwinds; the non-daemon thread keeps the process alive after the gateway restarts, and _pid_alive returns True forever → task stays running → claim extended forever → manual hermes kanban block <id> required.

The fix: when HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush logging + stdout/stderr and call os._exit(0) instead of raising KeyboardInterrupt. The kernel reclaims the PID immediately, _pid_alive's existing zombie-state check flips the task to crashed, and detect_crashed_workers re-spawns it on the next tick. No manual recovery needed.

Live reproduction + verification

Verified end-to-end with a synthetic worker that holds a non-daemon thread blocked on threading.Event().wait() (matches the shape of _wait_for_process blocking on a child subprocess):

Scenario Without fix With fix
SIGTERM + non-daemon thread alive hangs 4+s (process alive, dispatcher polling) dies in 0.10s (os._exit)
Dispatcher's _pid_alive(pid) result True forever (extended claim) False (zombie state detected by existing Linux probe)
Task state on next dispatcher tick still running (stuck) crashed → reclaimed → re-spawned
Manual recovery required YES NO

The repro and the fix-path test both live in tests/hermes_cli/test_signal_handler_kanban_worker.py as 3 regression cases.

Design notes

  • Env-gated: the os._exit path is gated specifically on os.environ.get("HERMES_KANBAN_TASK"). Interactive hermes chat -q (no env var) keeps the original KeyboardInterrupt behavior — unchanged. The env var is the canonical "this process is a dispatcher-spawned worker" signal (set in hermes_cli/kanban_db.py:5921).
  • SIGALRM deadman: before flushing, signal.alarm(2) is armed with a handler that calls os._exit(0). Reporter measured flush in <1ms under normal conditions; the alarm exists as a failsafe in case logging's stream handlers have a pathological blocking close. Never fires in practice; pure belt-and-suspenders.
  • Flush order: logging.shutdown()sys.stdout.flush()sys.stderr.flush()os._exit(0). Final telemetry trace preserved before the kernel reaps.
  • Non-daemon-thread invariant preserved: we don't change anything about how kanban workers spawn threads. Workers can continue to use non-daemon threads freely; the signal handler's job is to ensure SIGTERM still kills the process even when those threads exist.

Why not raise KeyboardInterrupt better

Reviewer feedback on the original PR draft (@hanselhansel) flagged operational concerns about os._exit(0) bypassing Python cleanup. The reporter (@andrewhosf) verified flush + 24h stability before re-proposing. The version landing here matches the validated design: explicit flush + SIGALRM deadman + env-gated to dispatcher-spawned workers only.

A "cleaner unwind" alternative (e.g., signal the non-daemon threads to exit, join them, then sys.exit) is genuinely harder because we don't own all the non-daemon thread surface (custom plugins can spawn anything they want). The signal-handler contract in production code has to terminate the process from outside the thread surface, which is what os._exit does.

Validation

  • tests/hermes_cli/test_signal_handler_kanban_worker.py: 3/3 pass
    • End-to-end SIGTERM with HERMES_KANBAN_TASK set → process dies in <2s
    • End-to-end SIGTERM without HERMES_KANBAN_TASK → original KeyboardInterrupt path, unchanged
    • Source-level invariant test catching future refactors that drop the env-gated exit
  • tests/tools/test_kanban_tools.py + tests/hermes_cli/test_kanban_db.py + tests/hermes_cli/test_kanban_core_functionality.py: 449/449 pass

Credit

  • @andrewhosf — identified the bug, diagnosed the KeyboardInterrupt-vs-non-daemon-thread root cause, designed the os._exit fix concept with flush + SIGALRM deadman, ran 24h stability validation. Co-authored trailer on the commit.
  • @hanselhansel — operational-risk review of the original draft (state corruption, telemetry loss, stability metrics) that drove the flush-before-exit + deadman design that's actually shipping here.

Infographic

kanban-sigterm-fix

The single-query signal handler in cli.py raises KeyboardInterrupt on
SIGTERM/SIGHUP. For interactive 'hermes chat -q' that unwinds the main
thread cleanly. For kanban workers spawned by the dispatcher, the
worker process is likely to have a non-daemon thread alive (terminal
_wait_for_process, custom plugins, etc.). With KeyboardInterrupt only
the main thread unwinds; the non-daemon thread keeps the process alive,
the gateway has already restarted, and the dispatcher's _pid_alive
check returns True forever — task stuck in 'running' indefinitely.

When HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush
logging + stdout/stderr, then os._exit(0) instead of raising
KeyboardInterrupt. The kernel reclaims the PID immediately, and the
existing zombie-state detection in _pid_alive flips the task to
crashed on the next dispatcher tick. detect_crashed_workers then
re-spawns it on the following tick — no manual recovery needed.

A SIGALRM(2s) deadman is armed before the flush so a pathological
blocking-I/O flush can't wedge the worker forever. In practice the
reporter measured flush in <1ms; the alarm is a failsafe, never
the common path.

Interactive (non-kanban) chat -q is unchanged — the env-gated branch
only fires for dispatcher-spawned workers.

Live verification on this machine:
- Without HERMES_KANBAN_TASK + non-daemon thread alive: process hangs
  alive 4+ seconds after SIGTERM. Dispatcher's _pid_alive returns
  True → task stuck.
- With HERMES_KANBAN_TASK + same non-daemon thread: process exits in
  0.10s via os._exit(0). Dispatcher reclaims on next tick.

Tests:
- tests/hermes_cli/test_signal_handler_kanban_worker.py (3 cases):
  end-to-end subprocess test with a non-daemon thread,
  HERMES_KANBAN_TASK env, SIGTERM, dispatcher-style _pid_alive check.
  Plus a source-level invariant test catching future refactors that
  drop the env-gated exit.
- 452/452 kanban tests pass.

Co-authored-by: andrewhosf <andrewho.sf@gmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-4f9b6288 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9545 on HEAD, 9544 on base (🆕 +1)

🆕 New issues (1):

Rule Count
unresolved-import 1
First entries
tests/hermes_cli/test_signal_handler_kanban_worker.py:31: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues: none

Unchanged: 5028 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit f30db14 into main May 28, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-4f9b6288 branch May 28, 2026 19:00
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Kanban workers stuck in zombie state after SIGTERM — claim never released, task blocked forever

2 participants