fix(kanban): reap completed worker children in dispatch_once by sonic-netizen · Pull Request #21169 · NousResearch/hermes-agent

sonic-netizen · 2026-05-07T11:40:18Z

Problem

The gateway-embedded dispatcher (default since kanban.dispatch_in_gateway = true) is the parent of every spawned kanban worker. _default_spawn calls subprocess.Popen(..., start_new_session=True) and returns the pid — start_new_session detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it wait()s for them.

Nothing in the dispatch loop ever calls waitpid. Result: every completed worker becomes a <defunct> zombie that lingers until the gateway exits.

We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. A docker restart is the only mitigation today.

Reproduction

Run hermes gateway run (gateway-embedded dispatch, default mode).
Dispatch any number of kanban tasks; let workers complete naturally.
ps -ef | grep '<defunct>' — every completed worker is a zombie owned by the gateway.

Fix

Add a non-blocking os.waitpid(-1, os.WNOHANG) reap loop at the top of dispatch_once, so every dispatcher tick (default 60s) drains any zombies that accumulated since the last tick.

try:
    while True:
        try:
            _pid, _status = os.waitpid(-1, os.WNOHANG)
        except ChildProcessError:
            break
        if _pid == 0:
            break
except Exception:
    pass

os is already imported in this module, so no new imports are needed.

Why here, not a `SIGCHLD` handler?

signal.signal requires the main thread; the gateway's threading model makes that placement non-trivial.
The dispatcher already runs dispatch_once on a fixed cadence; piggy-backing the reap on that cadence is the smallest, most local change.
Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions — benign on any reasonable system.
No interaction with detect_crashed_workers: that function only inspects rows where status = 'running', and rows reach done (and stop being inspected) before their workers exit. The reap therefore can't accidentally race the crash-detection logic.

Tests

Local verification on this branch

$ scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py
60 passed in 3.92s
$ scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py \
                       tests/stress/test_concurrency_reclaim_race.py \
                       tests/stress/test_concurrency.py \
                       tests/stress/test_atypical_scenarios.py
139 passed, 1 skipped, 4 warnings

The 2 test_dashboard_direct_status_change_* failures and 4 test_subprocess_e2e.py::Path-not-defined errors I saw exist identically on main without this patch (verified by git stash && git checkout main && rerun), so they're pre-existing baseline issues unrelated to this change. Happy to file separate issues if useful.

Production smoke test on a long-lived container

After applying the patch and restarting the gateway, dispatched two echo tasks back-to-back. Without the patch two zombies persist after completion; with the patch the container's zombie count stays at 0 across both lifecycles.

Risk

Very low. WNOHANG makes the call non-blocking. ChildProcessError (no children to reap) is handled. The outer try/except Exception is defense-in-depth so a misconfigured environment (e.g. running tests under a process supervisor that already reaps) can never break a dispatcher tick.

Found while

Running v0.12.0 in long-lived gateway-embedded mode on a QNAP container — the <defunct> accumulation was visible in ps -ef long before it became a real problem, so flagging upstream rather than only patching locally.

(Filing this as a single PR; I considered a separate started_at-cleanup change but that one turned out to be a v0.12.0-specific issue — main has reframed tasks.started_at as lifetime-of-task with task_runs.started_at carrying per-attempt semantics, so the cleanup I would have proposed is no longer applicable.)

The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `<defunct>` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit.

teknium1 · 2026-05-07T12:05:43Z

Merged via #21183 with your commit cherry-picked onto current main — authorship preserved in git log via rebase merge. Thanks @sonic-netizen! The waitpid reap lands at the top of dispatch_once exactly as you proposed.

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 7, 2026

teknium1 mentioned this pull request May 7, 2026

fix(kanban): heartbeat + reclaim + zombie + retry-cap fixes (#21147, #21141, #21169, #20881) #21183

Merged

teknium1 closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): reap completed worker children in dispatch_once#21169

fix(kanban): reap completed worker children in dispatch_once#21169
sonic-netizen wants to merge 1 commit into
NousResearch:mainfrom
sonic-netizen:fix/kanban-reap-completed-workers

sonic-netizen commented May 7, 2026

Uh oh!

teknium1 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sonic-netizen commented May 7, 2026

Problem

Reproduction

Fix

Why here, not a SIGCHLD handler?

Tests

Risk

Found while

Uh oh!

teknium1 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why here, not a `SIGCHLD` handler?