Skip to content

fix(kanban): reap completed worker children in dispatch_once#21169

Closed
sonic-netizen wants to merge 1 commit into
NousResearch:mainfrom
sonic-netizen:fix/kanban-reap-completed-workers
Closed

fix(kanban): reap completed worker children in dispatch_once#21169
sonic-netizen wants to merge 1 commit into
NousResearch:mainfrom
sonic-netizen:fix/kanban-reap-completed-workers

Conversation

@sonic-netizen

Copy link
Copy Markdown
Contributor

Problem

The gateway-embedded dispatcher (default since kanban.dispatch_in_gateway = true) is the parent of every spawned kanban worker. _default_spawn calls subprocess.Popen(..., start_new_session=True) and returns the pid — start_new_session detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it wait()s for them.

Nothing in the dispatch loop ever calls waitpid. Result: every completed worker becomes a <defunct> zombie that lingers until the gateway exits.

We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. A docker restart is the only mitigation today.

Reproduction

  1. Run hermes gateway run (gateway-embedded dispatch, default mode).
  2. Dispatch any number of kanban tasks; let workers complete naturally.
  3. ps -ef | grep '<defunct>' — every completed worker is a zombie owned by the gateway.

Fix

Add a non-blocking os.waitpid(-1, os.WNOHANG) reap loop at the top of dispatch_once, so every dispatcher tick (default 60s) drains any zombies that accumulated since the last tick.

try:
    while True:
        try:
            _pid, _status = os.waitpid(-1, os.WNOHANG)
        except ChildProcessError:
            break
        if _pid == 0:
            break
except Exception:
    pass

os is already imported in this module, so no new imports are needed.

Why here, not a SIGCHLD handler?

  • signal.signal requires the main thread; the gateway's threading model makes that placement non-trivial.
  • The dispatcher already runs dispatch_once on a fixed cadence; piggy-backing the reap on that cadence is the smallest, most local change.
  • Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions — benign on any reasonable system.
  • No interaction with detect_crashed_workers: that function only inspects rows where status = 'running', and rows reach done (and stop being inspected) before their workers exit. The reap therefore can't accidentally race the crash-detection logic.

Tests

Local verification on this branch

$ scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py
60 passed in 3.92s
$ scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py \
                       tests/stress/test_concurrency_reclaim_race.py \
                       tests/stress/test_concurrency.py \
                       tests/stress/test_atypical_scenarios.py
139 passed, 1 skipped, 4 warnings

The 2 test_dashboard_direct_status_change_* failures and 4 test_subprocess_e2e.py::Path-not-defined errors I saw exist identically on main without this patch (verified by git stash && git checkout main && rerun), so they're pre-existing baseline issues unrelated to this change. Happy to file separate issues if useful.

Production smoke test on a long-lived container

After applying the patch and restarting the gateway, dispatched two echo tasks back-to-back. Without the patch two zombies persist after completion; with the patch the container's zombie count stays at 0 across both lifecycles.

Risk

Very low. WNOHANG makes the call non-blocking. ChildProcessError (no children to reap) is handled. The outer try/except Exception is defense-in-depth so a misconfigured environment (e.g. running tests under a process supervisor that already reaps) can never break a dispatcher tick.

Found while

Running v0.12.0 in long-lived gateway-embedded mode on a QNAP container — the <defunct> accumulation was visible in ps -ef long before it became a real problem, so flagging upstream rather than only patching locally.

(Filing this as a single PR; I considered a separate started_at-cleanup change but that one turned out to be a v0.12.0-specific issue — main has reframed tasks.started_at as lifetime-of-task with task_runs.started_at carrying per-attempt semantics, so the cleanup I would have proposed is no longer applicable.)

The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway
= true`) is the parent of every spawned kanban worker. `_default_spawn`
calls `subprocess.Popen(..., start_new_session=True)` and returns the
pid — `start_new_session` detaches the controlling tty but does not
reparent to init, so the gateway keeps each worker as a child until it
`wait()`s for them.

Nothing in the dispatch loop ever calls `waitpid`. Result: every
completed worker becomes a `<defunct>` zombie that lingers until the
gateway exits. We hit ~430 zombies on a single hermes-agent container
after ~40 days of steady kanban traffic, approaching process-table
exhaustion on the host.

Fix: add a non-blocking reap loop at the top of `dispatch_once`, so
every dispatcher tick (default 60s) drains zombies that accumulated
since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError
means no children to reap.

Why here, not a SIGCHLD handler:
- signal.signal requires the main thread; gateway threading model makes
  that placement non-trivial.
- Bounded staleness: at default interval=60s the maximum live zombie
  count is one tick's worth of worker completions.
- No interaction with detect_crashed_workers: that function only inspects
  rows where status='running', and rows reach 'done' (and stop being
  inspected) before their workers exit.
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 7, 2026
@teknium1

teknium1 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Merged via #21183 with your commit cherry-picked onto current main — authorship preserved in git log via rebase merge. Thanks @sonic-netizen! The waitpid reap lands at the top of dispatch_once exactly as you proposed.

@teknium1 teknium1 closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants