fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race by steveonjava · Pull Request #30727 · NousResearch/hermes-agent

steveonjava · 2026-05-23T03:33:14Z

⚠️ This PR adds a grace period to detect_crashed_workers() to prevent multi-dispatcher race conditions. It is a correctness fix for an internal dispatcher heuristic — not a change to any security boundary (isolation, auth, allowlists, or credential handling). It has been classified as public_pr per the upstream SECURITY.md policy which states in-process heuristics are not security boundaries.

What does this PR do?

Adds a 30-second grace period to detect_crashed_workers() to suppress
false-positive reclaims of freshly-spawned worker PIDs. Under
dispatch_in_gateway: true with two or more gateways, _pid_alive()
can transiently return False during the fork → /proc-visibility
window. The sibling dispatcher then reclaims the task and spawns a
duplicate worker. The existing 15-minute claim TTL still catches
genuinely-crashed workers; the grace only covers the launch window.

HERMES_KANBAN_CRASH_GRACE_SECONDS overrides the default. Value 0
restores the prior immediate-reclaim behaviour and is set on the
kanban_home test fixture so existing crash-detection tests continue
to pass.

Related Issue

This PR is a companion to merged PR #23442 (fix(kanban): extend stale claim instead of killing live worker), which addressed the same
multi-dispatcher race in release_stale_claims() and closed
#23025 (recurring stall-loop symptom).

Related context:

recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025 (closed) — recurring stall loop in multi-gateway
deployments; root issue closed by fix(kanban): extend stale claim instead of killing live worker (salvage #23071) #23442. This PR closes the
remaining detect_crashed_workers path.
kanban dispatcher: macOS zombie detection is a no-op — _pid_alive returns True for defunct workers #20015 (closed) — _pid_alive false-negative behaviour on macOS
that this PR is also resilient to (grace period covers any
short-lived false negative, platform-independent).
[Bug]: Kanban stale claim locks from dead workers have no auto-cleanup — tasks permanently stuck until manual intervention #22926 (open) — broader stale-claim auto-cleanup; out of scope
for this PR.

No exact-match upstream issue exists for the detect_crashed_workers
launch-window race specifically. Open to filing one if maintainers
prefer.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

hermes_cli/kanban_db.py:
- Add DEFAULT_CRASH_GRACE_SECONDS = 30 constant.
- Add _resolve_crash_grace_seconds() helper that reads
  HERMES_KANBAN_CRASH_GRACE_SECONDS with validation/fallback.
- detect_crashed_workers() now selects started_at and skips the
  _pid_alive check when time.time() - started_at < grace.
tests/hermes_cli/test_kanban_db.py: 3 regression tests
(test_detect_crashed_workers_skips_freshly_claimed_tasks,
test_detect_crashed_workers_grace_period_env_override,
test_resolve_crash_grace_seconds_handles_bad_env).
tests/hermes_cli/test_kanban_core_functionality.py: fixture patch
HERMES_KANBAN_CRASH_GRACE_SECONDS=0 on kanban_home so existing
crash-detection tests retain pre-fix semantics.
scripts/release.py: add steveonjava@gmail.com → steveonjava to
AUTHOR_MAP (required by .github/workflows/contributor-check.yml
for first-time contributors).

Diff: 4 files, +115 / -1.

How to Test

# Grace-period regression tests (3 new)
scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -- -k "crash_grace or detect_crashed_workers_skips or detect_crashed_workers_grace or resolve_crash_grace" -q

# Existing crash-detection tests continue to pass with fixture patch
scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py -- -q

# Optional: reproduce pre-fix behaviour by disabling the grace
HERMES_KANBAN_CRASH_GRACE_SECONDS=0 scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -- -q

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits
I searched for existing PRs to make sure this isn't a duplicate
(closest is fix(kanban): extend stale claim instead of killing live worker (salvage #23071) #23442; this is the companion fix to a different code
path in the same race class)
My PR contains only changes related to this fix/feature
I've run the affected tests; the 3 new regression tests pass
I've added tests for my changes
I've tested on my platform
- Flag for review: verified on Linux only.

Documentation & Housekeeping

I've updated relevant documentation — or N/A
- N/A — env-var-only override, no doc surface changes.
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
- N/A — HERMES_KANBAN_CRASH_GRACE_SECONDS is intentionally env-only.
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
- N/A — internal heuristic only.
I've considered cross-platform impact — or N/A
- Yes — SQLite + stdlib time only. _pid_alive() already uses
  psutil.pid_exists().
I've updated tool descriptions/schemas if I changed tool behavior — or N/A
- N/A — no tool changes.

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour), NousResearch#22926 (stale-claim auto-cleanup).

steveonjava · 2026-05-26T22:36:32Z

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

kshitijk4poor · 2026-05-27T21:32:50Z

Merged via #33482 (commit c002668). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 23, 2026

steveonjava force-pushed the feat/kanban-detect-crashed-workers-grace-period branch from f6b404a to 6d49e2f Compare May 24, 2026 02:24

steveonjava force-pushed the feat/kanban-detect-crashed-workers-grace-period branch from 6d49e2f to 88c4b8d Compare May 24, 2026 03:28

Merge branch 'main' into feat/kanban-detect-crashed-workers-grace-period

6ff4ee4

steveonjava marked this pull request as ready for review May 24, 2026 03:32

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race#30727

fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race#30727
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-detect-crashed-workers-grace-period

steveonjava commented May 23, 2026 •

edited

Loading

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steveonjava commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steveonjava commented May 23, 2026 •

edited

Loading