fix(kanban): extend stale claim instead of killing live worker (salvage #23071) by teknium1 · Pull Request #23442 · NousResearch/hermes-agent

teknium1 · 2026-05-10T22:22:53Z

Summary

Stops the kanban dispatcher from killing healthy workers that are slow. Workers running slow models (kimi-k2.6 was the reported case) can spend longer than the 15-min DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call — they make no tool calls, so they don't heartbeat, so the dispatcher used to mark the claim stale and SIGTERM the worker mid-flight. The respawned worker hit the same trap, producing the empty-workspace stall loop reported in #23025.

How

release_stale_claims() now checks if the worker's host-local PID is alive before reclaiming. If alive: extend the claim by another DEFAULT_CLAIM_TTL_SECONDS and emit a claim_extended event. If dead (or non-host-local): reclaim as before.

Upper bounds are unchanged:

enforce_max_runtime still hard-caps task runtime per the max_runtime_seconds column (catches genuinely-stuck-but-PID-alive workers — deadlocks, infinite loops).
detect_crashed_workers still reaps workers whose PID has vanished.

The host-local check (lock.startswith(host_prefix) from _claimer_id().split(":", 1)[0]) means we only trust _pid_alive when the lock was set by THIS host. Cross-host claims (rare; happens if you migrate the kanban DB between machines) fall through to the normal reclaim path because we can't safely interpret a PID number from a different host.

Changes

hermes_cli/kanban_db.py::release_stale_claims: add the live-PID extension branch with a CAS-guarded UPDATE, run-row sync, and claim_extended event. The reclaim path's payload is also enriched with claim_expires, last_heartbeat_at, worker_pid, host_local, and now so operators can tell from task_events whether a kill was timing-driven or a genuinely-dead worker.
tests/hermes_cli/test_kanban_db.py: 2 new tests (test_stale_claim_with_live_pid_extends_instead_of_reclaiming + test_stale_claim_reclaim_event_records_diagnostic_payload) and the existing test_stale_claim_reclaimed flipped to _pid_alive=False so it exercises the still-correct dead-PID reclaim path.

Validation

	Before	After
`tests/hermes_cli/test_kanban_db.py` (stale-claim/reclaim subset)	2/2	4/4
`tests/hermes_cli/test_kanban_db.py + test_kanban_core_functionality.py + tests/tools/test_kanban_tools.py`	288/288	290/290

Closes #23025 via salvage. Salvage of #23071. Original commit by @konsisumer cherry-picked with authorship preserved (re-attributed during salvage from der@konsi.org to the GitHub-noreply form for release-notes credit). AUTHOR_MAP entry added.

Workers running slow models (e.g. kimi-k2.6) can spend longer than DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making no tool calls and therefore not heartbeating. release_stale_claims previously reclaimed these healthy workers, producing the spawn-then-immediately-reclaim loop reported in #23025. When a stale-by-TTL claim's host-local worker PID is still alive, extend the claim (emit a claim_extended event) rather than killing it. enforce_max_runtime / detect_crashed_workers remain the upper bounds for genuinely wedged or dead workers. Reclaim events now also record claim_expires, last_heartbeat_at, worker_pid, and host_local so operators can see why a worker was killed.

github-actions · 2026-05-10T22:23:59Z

🔎 Lint report: `salvage/pr-23071-extend-live-claim` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8078 on HEAD, 8075 on base (🆕 +3)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4255 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour), NousResearch#22926 (stale-claim auto-cleanup).

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR #23442 (`release_stale_claims`, closes #23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: #20015 (`_pid_alive` false-negative behaviour),

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour), #AI commit#

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR #23442 (`release_stale_claims`, closes #23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: #20015 (`_pid_alive` false-negative behaviour),

konsisumer and others added 2 commits May 10, 2026 15:22

chore: AUTHOR_MAP entry for konsisumer noreply (#23071)

5807f6f

teknium1 merged commit 59d3f24 into main May 10, 2026
12 of 15 checks passed

teknium1 deleted the salvage/pr-23071-extend-live-claim branch May 10, 2026 22:23

teknium1 mentioned this pull request May 10, 2026

fix(kanban): extend stale claim instead of killing live worker #23071

Closed

steveonjava mentioned this pull request May 23, 2026

fix(kanban): add grace period to detect_crashed_workers to prevent multi-dispatcher race #30727

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): extend stale claim instead of killing live worker (salvage #23071)#23442

fix(kanban): extend stale claim instead of killing live worker (salvage #23071)#23442
teknium1 merged 2 commits into
mainfrom
salvage/pr-23071-extend-live-claim

teknium1 commented May 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 10, 2026

Summary

How

Changes

Validation

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026

🔎 Lint report: salvage/pr-23071-extend-live-claim vs origin/main

ruff

ty (type checker)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `salvage/pr-23071-extend-live-claim` vs `origin/main`