fix(kanban): reap completed worker children in dispatch_once#21169
Closed
sonic-netizen wants to merge 1 commit into
Closed
fix(kanban): reap completed worker children in dispatch_once#21169sonic-netizen wants to merge 1 commit into
sonic-netizen wants to merge 1 commit into
Conversation
The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `<defunct>` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit.
Contributor
|
Merged via #21183 with your commit cherry-picked onto current main — authorship preserved in git log via rebase merge. Thanks @sonic-netizen! The waitpid reap lands at the top of dispatch_once exactly as you proposed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The gateway-embedded dispatcher (default since
kanban.dispatch_in_gateway = true) is the parent of every spawned kanban worker._default_spawncallssubprocess.Popen(..., start_new_session=True)and returns the pid —start_new_sessiondetaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until itwait()s for them.Nothing in the dispatch loop ever calls
waitpid. Result: every completed worker becomes a<defunct>zombie that lingers until the gateway exits.We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. A
docker restartis the only mitigation today.Reproduction
hermes gateway run(gateway-embedded dispatch, default mode).ps -ef | grep '<defunct>'— every completed worker is a zombie owned by the gateway.Fix
Add a non-blocking
os.waitpid(-1, os.WNOHANG)reap loop at the top ofdispatch_once, so every dispatcher tick (default 60s) drains any zombies that accumulated since the last tick.osis already imported in this module, so no new imports are needed.Why here, not a
SIGCHLDhandler?signal.signalrequires the main thread; the gateway's threading model makes that placement non-trivial.dispatch_onceon a fixed cadence; piggy-backing the reap on that cadence is the smallest, most local change.interval=60sthe maximum live zombie count is one tick's worth of worker completions — benign on any reasonable system.detect_crashed_workers: that function only inspects rows wherestatus = 'running', and rows reachdone(and stop being inspected) before their workers exit. The reap therefore can't accidentally race the crash-detection logic.Tests
Local verification on this branch
The 2
test_dashboard_direct_status_change_*failures and 4test_subprocess_e2e.py::Path-not-definederrors I saw exist identically onmainwithout this patch (verified bygit stash && git checkout main && rerun), so they're pre-existing baseline issues unrelated to this change. Happy to file separate issues if useful.Production smoke test on a long-lived container
After applying the patch and restarting the gateway, dispatched two echo tasks back-to-back. Without the patch two zombies persist after completion; with the patch the container's zombie count stays at 0 across both lifecycles.
Risk
Very low.
WNOHANGmakes the call non-blocking.ChildProcessError(no children to reap) is handled. The outertry/except Exceptionis defense-in-depth so a misconfigured environment (e.g. running tests under a process supervisor that already reaps) can never break a dispatcher tick.Found while
Running v0.12.0 in long-lived gateway-embedded mode on a QNAP container — the
<defunct>accumulation was visible inps -eflong before it became a real problem, so flagging upstream rather than only patching locally.(Filing this as a single PR; I considered a separate
started_at-cleanup change but that one turned out to be a v0.12.0-specific issue —mainhas reframedtasks.started_atas lifetime-of-task withtask_runs.started_atcarrying per-attempt semantics, so the cleanup I would have proposed is no longer applicable.)