fix(kanban): hoist zombie reaper out of dispatch_once by steveonjava · Pull Request #32301 · NousResearch/hermes-agent

steveonjava · 2026-05-25T23:10:35Z

What does this PR do?

This PR hoists the worker-zombie reaper out of dispatch_once() so it runs even when the dispatcher tick fails before reaching the spawn loop. Previously, if dispatch_once() raised (corrupt DB, connect failure, schema migration error, network glitch on a remote backend), the call to reap_worker_zombies() was skipped, and zombie worker processes from earlier ticks accumulated until either the dispatcher recovered on its own or an operator restarted the gateway.

The fix extracts reap_worker_zombies() to a standalone callable, and moves its invocation from inside dispatch_once() to the per-tick wrapper in the gateway dispatcher loop. The reaper now runs unconditionally each tick, regardless of whether the rest of the tick succeeds. The function also returns the list of reaped PIDs so the caller can log them for telemetry; previously it returned only a count.

This is a small surface change (one function extracted, one call moved up one stack frame) with a high payoff: the failure mode it fixes is the "dispatcher tick raises, zombies accumulate, eventually the host runs out of PIDs in the worker namespace" cascade, which has shown up multiple times in production on this fork when transient EIO or a stuck _validate_sqlite_header causes the rest of the tick to abort.

Related Issues

Refs #21183 (origin of the in-dispatch_once reaper code being improved). No upstream issue tracks this exact failure mode.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature
Breaking change
Documentation update

Changes Made

hermes_cli/kanban_db.py — Extracted reap_worker_zombies() from inside dispatch_once() into a module-level function. Changed return type from int (count) to list[int] (reaped pids). Internal callers updated. dispatch_once() still calls it at the start of its work for the common case; the new behavior is that the per-tick wrapper in gateway/run.py ALSO calls it independently, so the reaper runs even when dispatch_once() raises.
gateway/run.py — Per-tick wrapper now calls reap_worker_zombies() BEFORE entering the per-board dispatch loop and logs the pids with kanban dispatcher: reaped %d zombie worker(s), pids=%s. Exceptions in the reaper are caught and logged separately so they cannot break the dispatcher tick.
scripts/release.py — AUTHOR_MAP entry for the contributor.
tests/hermes_cli/test_kanban_db.py — 7 new tests: reaper called from per-tick wrapper even when dispatch_once raises; pid list returned and logged; dispatch_once still reaps via the extracted function on the common path; no double-reap when both paths fire; reaper exception isolation; empty-pids path; concurrent-tick safety.

How to Test

scripts/run_tests.sh                                                       # full suite (matches CI)
pytest tests/hermes_cli/test_kanban_db.py -q -k "reap or zombie or hoist"  # the 7 new + adjacent tests

Checklist

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(kanban): ...)
I've checked there isn't already a PR for this change (PR fix(kanban): merge complete_task and recompute_ready into a single write txn #31891 is a separate refactor of complete_task / recompute_ready and does NOT touch the reaper path)
My PR includes only changes related to one bug/feature
scripts/run_tests.sh passes locally
I've added tests for my changes (required for bug fixes — 7 new tests)
I've tested on my platform (Linux/macOS/Windows) — human reviewer to confirm

The reaper block ran only inside dispatch_once(), so a board DB failure that prevented dispatch_once() from executing would leave zombie worker processes unreaped. Extract the logic into reap_worker_zombies() and call it at the top of _kanban_dispatcher_watcher's tick loop, before per-board work begins. This way zombie cleanup runs each tick regardless of whether any board is healthy. dispatch_once() is updated to call reap_worker_zombies() so behavior on the normal path is unchanged.

The function previously returned an int count, which prevented the watcher from logging which pids were reaped. Returning list[int] instead lets _kanban_dispatcher_watcher log 'reaped N zombie worker(s), pids=[...]' per AC4. Updated tests to assert on the pid list rather than the count.

…a_extracted_fn Extra indent level in the nested with-patch block was cosmetically wrong. Fix to standard 4-space nesting.

steveonjava · 2026-05-26T22:36:40Z

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR #32301 into one logical change for batch review.

kshitijk4poor · 2026-05-27T21:32:44Z

Merged via #33482 (commit ffdc937). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.

Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review. #AI commit#

Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.

steveonjava added 4 commits May 24, 2026 21:45

chore(release): add steveonjava@gmail.com alias to AUTHOR_MAP

303568a

fix(kanban): correct indentation in test_dispatch_once_still_reaps_vi…

286b644

…a_extracted_fn Extra indent level in the nested with-patch block was cosmetically wrong. Fix to standard 4-space nesting.

steveonjava force-pushed the fix/kanban-zombie-reaper-blocked-by-failed-connect branch from 4a07684 to 286b644 Compare May 25, 2026 23:25

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels May 25, 2026

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): hoist zombie reaper out of dispatch_once#32301

fix(kanban): hoist zombie reaper out of dispatch_once#32301
steveonjava wants to merge 4 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-zombie-reaper-blocked-by-failed-connect

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steveonjava commented May 25, 2026

What does this PR do?

Related Issues

Type of Change

Changes Made

How to Test

Checklist

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants