fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race#30727
Closed
steveonjava wants to merge 2 commits into
Closed
Conversation
f6b404a to
6d49e2f
Compare
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour), NousResearch#22926 (stale-claim auto-cleanup).
6d49e2f to
88c4b8d
Compare
This was referenced May 26, 2026
Closed
Contributor
Author
|
Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing. |
Collaborator
|
Merged via #33482 (commit c002668). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a 30-second grace period to
detect_crashed_workers()to suppressfalse-positive reclaims of freshly-spawned worker PIDs. Under
dispatch_in_gateway: truewith two or more gateways,_pid_alive()can transiently return False during the fork →
/proc-visibilitywindow. The sibling dispatcher then reclaims the task and spawns a
duplicate worker. The existing 15-minute claim TTL still catches
genuinely-crashed workers; the grace only covers the launch window.
HERMES_KANBAN_CRASH_GRACE_SECONDSoverrides the default. Value0restores the prior immediate-reclaim behaviour and is set on the
kanban_hometest fixture so existing crash-detection tests continueto pass.
Related Issue
This PR is a companion to merged PR #23442 (
fix(kanban): extend stale claim instead of killing live worker), which addressed the samemulti-dispatcher race in
release_stale_claims()and closed#23025 (recurring stall-loop symptom).
Related context:
deployments; root issue closed by fix(kanban): extend stale claim instead of killing live worker (salvage #23071) #23442. This PR closes the
remaining
detect_crashed_workerspath._pid_alivefalse-negative behaviour on macOSthat this PR is also resilient to (grace period covers any
short-lived false negative, platform-independent).
for this PR.
No exact-match upstream issue exists for the
detect_crashed_workerslaunch-window race specifically. Open to filing one if maintainers
prefer.
Type of Change
Changes Made
hermes_cli/kanban_db.py:DEFAULT_CRASH_GRACE_SECONDS = 30constant._resolve_crash_grace_seconds()helper that readsHERMES_KANBAN_CRASH_GRACE_SECONDSwith validation/fallback.detect_crashed_workers()now selectsstarted_atand skips the_pid_alivecheck whentime.time() - started_at < grace.tests/hermes_cli/test_kanban_db.py: 3 regression tests(
test_detect_crashed_workers_skips_freshly_claimed_tasks,test_detect_crashed_workers_grace_period_env_override,test_resolve_crash_grace_seconds_handles_bad_env).tests/hermes_cli/test_kanban_core_functionality.py: fixture patchHERMES_KANBAN_CRASH_GRACE_SECONDS=0onkanban_homeso existingcrash-detection tests retain pre-fix semantics.
scripts/release.py: addsteveonjava@gmail.com → steveonjavatoAUTHOR_MAP(required by.github/workflows/contributor-check.ymlfor first-time contributors).
Diff: 4 files, +115 / -1.
How to Test
Checklist
Code
(closest is fix(kanban): extend stale claim instead of killing live worker (salvage #23071) #23442; this is the companion fix to a different code
path in the same race class)
Documentation & Housekeeping
HERMES_KANBAN_CRASH_GRACE_SECONDSis intentionally env-only.timeonly._pid_alive()already usespsutil.pid_exists().