Skip to content

fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race#30727

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-detect-crashed-workers-grace-period
Closed

fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race#30727
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-detect-crashed-workers-grace-period

Conversation

@steveonjava

@steveonjava steveonjava commented May 23, 2026

Copy link
Copy Markdown
Contributor

⚠️ This PR adds a grace period to detect_crashed_workers() to prevent multi-dispatcher race conditions. It is a correctness fix for an internal dispatcher heuristic — not a change to any security boundary (isolation, auth, allowlists, or credential handling). It has been classified as public_pr per the upstream SECURITY.md policy which states in-process heuristics are not security boundaries.

What does this PR do?

Adds a 30-second grace period to detect_crashed_workers() to suppress
false-positive reclaims of freshly-spawned worker PIDs. Under
dispatch_in_gateway: true with two or more gateways, _pid_alive()
can transiently return False during the fork → /proc-visibility
window. The sibling dispatcher then reclaims the task and spawns a
duplicate worker. The existing 15-minute claim TTL still catches
genuinely-crashed workers; the grace only covers the launch window.

HERMES_KANBAN_CRASH_GRACE_SECONDS overrides the default. Value 0
restores the prior immediate-reclaim behaviour and is set on the
kanban_home test fixture so existing crash-detection tests continue
to pass.

Related Issue

This PR is a companion to merged PR #23442 (fix(kanban): extend stale claim instead of killing live worker), which addressed the same
multi-dispatcher race in release_stale_claims() and closed
#23025 (recurring stall-loop symptom).

Related context:

No exact-match upstream issue exists for the detect_crashed_workers
launch-window race specifically. Open to filing one if maintainers
prefer.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • hermes_cli/kanban_db.py:
    • Add DEFAULT_CRASH_GRACE_SECONDS = 30 constant.
    • Add _resolve_crash_grace_seconds() helper that reads
      HERMES_KANBAN_CRASH_GRACE_SECONDS with validation/fallback.
    • detect_crashed_workers() now selects started_at and skips the
      _pid_alive check when time.time() - started_at < grace.
  • tests/hermes_cli/test_kanban_db.py: 3 regression tests
    (test_detect_crashed_workers_skips_freshly_claimed_tasks,
    test_detect_crashed_workers_grace_period_env_override,
    test_resolve_crash_grace_seconds_handles_bad_env).
  • tests/hermes_cli/test_kanban_core_functionality.py: fixture patch
    HERMES_KANBAN_CRASH_GRACE_SECONDS=0 on kanban_home so existing
    crash-detection tests retain pre-fix semantics.
  • scripts/release.py: add steveonjava@gmail.com → steveonjava to
    AUTHOR_MAP (required by .github/workflows/contributor-check.yml
    for first-time contributors).

Diff: 4 files, +115 / -1.

How to Test

# Grace-period regression tests (3 new)
scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -- -k "crash_grace or detect_crashed_workers_skips or detect_crashed_workers_grace or resolve_crash_grace" -q

# Existing crash-detection tests continue to pass with fixture patch
scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py -- -q

# Optional: reproduce pre-fix behaviour by disabling the grace
HERMES_KANBAN_CRASH_GRACE_SECONDS=0 scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -- -q

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
    (closest is fix(kanban): extend stale claim instead of killing live worker (salvage #23071) #23442; this is the companion fix to a different code
    path in the same race class)
  • My PR contains only changes related to this fix/feature
  • I've run the affected tests; the 3 new regression tests pass
  • I've added tests for my changes
  • I've tested on my platform
    • Flag for review: verified on Linux only.

Documentation & Housekeeping

  • I've updated relevant documentation — or N/A
    • N/A — env-var-only override, no doc surface changes.
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
    • N/A — HERMES_KANBAN_CRASH_GRACE_SECONDS is intentionally env-only.
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
    • N/A — internal heuristic only.
  • I've considered cross-platform impact — or N/A
    • Yes — SQLite + stdlib time only. _pid_alive() already uses
      psutil.pid_exists().
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A
    • N/A — no tool changes.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 23, 2026
@steveonjava steveonjava force-pushed the feat/kanban-detect-crashed-workers-grace-period branch from f6b404a to 6d49e2f Compare May 24, 2026 02:24
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
NousResearch#22926 (stale-claim auto-cleanup).
@steveonjava steveonjava force-pushed the feat/kanban-detect-crashed-workers-grace-period branch from 6d49e2f to 88c4b8d Compare May 24, 2026 03:28
@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit c002668). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants