Skip to content

fix(kanban): hoist zombie reaper out of dispatch_once#32301

Closed
steveonjava wants to merge 4 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-zombie-reaper-blocked-by-failed-connect
Closed

fix(kanban): hoist zombie reaper out of dispatch_once#32301
steveonjava wants to merge 4 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-zombie-reaper-blocked-by-failed-connect

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR hoists the worker-zombie reaper out of dispatch_once() so it runs even when the dispatcher tick fails before reaching the spawn loop. Previously, if dispatch_once() raised (corrupt DB, connect failure, schema migration error, network glitch on a remote backend), the call to reap_worker_zombies() was skipped, and zombie worker processes from earlier ticks accumulated until either the dispatcher recovered on its own or an operator restarted the gateway.

The fix extracts reap_worker_zombies() to a standalone callable, and moves its invocation from inside dispatch_once() to the per-tick wrapper in the gateway dispatcher loop. The reaper now runs unconditionally each tick, regardless of whether the rest of the tick succeeds. The function also returns the list of reaped PIDs so the caller can log them for telemetry; previously it returned only a count.

This is a small surface change (one function extracted, one call moved up one stack frame) with a high payoff: the failure mode it fixes is the "dispatcher tick raises, zombies accumulate, eventually the host runs out of PIDs in the worker namespace" cascade, which has shown up multiple times in production on this fork when transient EIO or a stuck _validate_sqlite_header causes the rest of the tick to abort.

Related Issues

Refs #21183 (origin of the in-dispatch_once reaper code being improved). No upstream issue tracks this exact failure mode.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Documentation update

Changes Made

  • hermes_cli/kanban_db.py — Extracted reap_worker_zombies() from inside dispatch_once() into a module-level function. Changed return type from int (count) to list[int] (reaped pids). Internal callers updated. dispatch_once() still calls it at the start of its work for the common case; the new behavior is that the per-tick wrapper in gateway/run.py ALSO calls it independently, so the reaper runs even when dispatch_once() raises.
  • gateway/run.py — Per-tick wrapper now calls reap_worker_zombies() BEFORE entering the per-board dispatch loop and logs the pids with kanban dispatcher: reaped %d zombie worker(s), pids=%s. Exceptions in the reaper are caught and logged separately so they cannot break the dispatcher tick.
  • scripts/release.py — AUTHOR_MAP entry for the contributor.
  • tests/hermes_cli/test_kanban_db.py — 7 new tests: reaper called from per-tick wrapper even when dispatch_once raises; pid list returned and logged; dispatch_once still reaps via the extracted function on the common path; no double-reap when both paths fire; reaper exception isolation; empty-pids path; concurrent-tick safety.

How to Test

scripts/run_tests.sh                                                       # full suite (matches CI)
pytest tests/hermes_cli/test_kanban_db.py -q -k "reap or zombie or hoist"  # the 7 new + adjacent tests

Checklist

The reaper block ran only inside dispatch_once(), so a board DB failure
that prevented dispatch_once() from executing would leave zombie worker
processes unreaped.

Extract the logic into reap_worker_zombies() and call it at the top of
_kanban_dispatcher_watcher's tick loop, before per-board work begins. This
way zombie cleanup runs each tick regardless of whether any board is healthy.

dispatch_once() is updated to call reap_worker_zombies() so behavior on
the normal path is unchanged.
The function previously returned an int count, which prevented the watcher from logging which pids were reaped. Returning list[int] instead lets _kanban_dispatcher_watcher log 'reaped N zombie worker(s), pids=[...]' per AC4.

Updated tests to assert on the pid list rather than the count.
…a_extracted_fn

Extra indent level in the nested with-patch block was cosmetically wrong. Fix to standard 4-space nesting.
@steveonjava steveonjava force-pushed the fix/kanban-zombie-reaper-blocked-by-failed-connect branch from 4a07684 to 286b644 Compare May 25, 2026 23:25
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels May 25, 2026
@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

kshitijk4poor pushed a commit that referenced this pull request May 27, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR #32301 into one logical change for batch review.
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit ffdc937). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants