fix(kanban): hoist zombie reaper out of dispatch_once#32301
Closed
steveonjava wants to merge 4 commits into
Closed
fix(kanban): hoist zombie reaper out of dispatch_once#32301steveonjava wants to merge 4 commits into
steveonjava wants to merge 4 commits into
Conversation
The reaper block ran only inside dispatch_once(), so a board DB failure that prevented dispatch_once() from executing would leave zombie worker processes unreaped. Extract the logic into reap_worker_zombies() and call it at the top of _kanban_dispatcher_watcher's tick loop, before per-board work begins. This way zombie cleanup runs each tick regardless of whether any board is healthy. dispatch_once() is updated to call reap_worker_zombies() so behavior on the normal path is unchanged.
The function previously returned an int count, which prevented the watcher from logging which pids were reaped. Returning list[int] instead lets _kanban_dispatcher_watcher log 'reaped N zombie worker(s), pids=[...]' per AC4. Updated tests to assert on the pid list rather than the count.
…a_extracted_fn Extra indent level in the nested with-patch block was cosmetically wrong. Fix to standard 4-space nesting.
4a07684 to
286b644
Compare
This was referenced May 26, 2026
Closed
Contributor
Author
|
Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing. |
kshitijk4poor
pushed a commit
that referenced
this pull request
May 27, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR #32301 into one logical change for batch review.
Collaborator
|
Merged via #33482 (commit ffdc937). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks! |
mathias3
pushed a commit
to mathias3/hermes-agent
that referenced
this pull request
May 28, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
Bryce-huang
pushed a commit
to wbkunlun/hermes-agent
that referenced
this pull request
May 29, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review. #AI commit#
mosaiq-systems
pushed a commit
to mosaiq-systems/hermes-agent
that referenced
this pull request
May 29, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR hoists the worker-zombie reaper out of
dispatch_once()so it runs even when the dispatcher tick fails before reaching the spawn loop. Previously, ifdispatch_once()raised (corrupt DB, connect failure, schema migration error, network glitch on a remote backend), the call toreap_worker_zombies()was skipped, and zombie worker processes from earlier ticks accumulated until either the dispatcher recovered on its own or an operator restarted the gateway.The fix extracts
reap_worker_zombies()to a standalone callable, and moves its invocation from insidedispatch_once()to the per-tick wrapper in the gateway dispatcher loop. The reaper now runs unconditionally each tick, regardless of whether the rest of the tick succeeds. The function also returns the list of reaped PIDs so the caller can log them for telemetry; previously it returned only a count.This is a small surface change (one function extracted, one call moved up one stack frame) with a high payoff: the failure mode it fixes is the "dispatcher tick raises, zombies accumulate, eventually the host runs out of PIDs in the worker namespace" cascade, which has shown up multiple times in production on this fork when transient EIO or a stuck
_validate_sqlite_headercauses the rest of the tick to abort.Related Issues
Refs #21183 (origin of the in-dispatch_once reaper code being improved). No upstream issue tracks this exact failure mode.
Type of Change
Changes Made
hermes_cli/kanban_db.py— Extractedreap_worker_zombies()from insidedispatch_once()into a module-level function. Changed return type fromint(count) tolist[int](reaped pids). Internal callers updated.dispatch_once()still calls it at the start of its work for the common case; the new behavior is that the per-tick wrapper ingateway/run.pyALSO calls it independently, so the reaper runs even whendispatch_once()raises.gateway/run.py— Per-tick wrapper now callsreap_worker_zombies()BEFORE entering the per-board dispatch loop and logs the pids withkanban dispatcher: reaped %d zombie worker(s), pids=%s. Exceptions in the reaper are caught and logged separately so they cannot break the dispatcher tick.scripts/release.py— AUTHOR_MAP entry for the contributor.tests/hermes_cli/test_kanban_db.py— 7 new tests: reaper called from per-tick wrapper even when dispatch_once raises; pid list returned and logged; dispatch_once still reaps via the extracted function on the common path; no double-reap when both paths fire; reaper exception isolation; empty-pids path; concurrent-tick safety.How to Test
Checklist
fix(kanban): ...)complete_task/recompute_readyand does NOT touch the reaper path)scripts/run_tests.shpasses locally