Skip to content

[Bug]: Kanban workers stuck in zombie state after SIGTERM — claim never released, task blocked forever #28181

@andrewhosf

Description

@andrewhosf

Summary

When a Kanban worker process receives SIGTERM (from gateway restart, launchd/systemd cgroup cleanup, enforce_max_runtime, or _terminate_reclaimed_worker), the single-query signal handler (_signal_handler_q in cli.py) calls _agent.interrupt() and raises KeyboardInterrupt — but the Python process does not exit cleanly. It remains in the process table as a zombie (<defunct> on macOS).

The dispatchers detect_crashed_workers / release_stale_claims check os.kill(pid, 0) which returns True for zombie processes (they still have a PID table entry). The dispatcher thinks the worker is still alive and keeps extending the claim forever. The task remains running indefinitely and never gets re-dispatched.

Root Cause

In cli.py lines 14144–14158, _signal_handler_q is registered for SIGTERM and SIGHUP in single-query mode (chat -q). When a Kanban worker receives the signal:

  1. _signal_handler_q calls _agent.interrupt(...) and sleeps for the grace window
  2. Raises KeyboardInterrupt
  3. The agent loop dies but the process stays alive as a zombie
  4. _pid_alive() uses os.kill(pid, 0) which returns True even for zombies
  5. The dispatcher extends the claim forever — task stuck permanently

Impact

  • Kanban tasks stuck in running state forever
  • Downstream dependent tasks never execute
  • Manual recovery required
  • Affects all gateway-managed kanban setups, especially macOS with launchd

Steps to Reproduce

  1. Set up kanban with dispatch_in_gateway: true
  2. Run hermes gateway restart while a kanban worker is running
  3. Observe: the worker process becomes <defunct>, task stays running forever

Proposed Fix

The signal handler should check for HERMES_KANBAN_TASK env var and call block_task() to release the claim before dying. Fix being tested locally.

Workaround

hermes kanban block <task_id> "Worker was interrupted — manual recovery"

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/cliCLI entry point, hermes_cli/, setup wizardcomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions