Skip to content

fix(kanban): extend stale claim instead of killing live worker (salvage #23071)#23442

Merged
teknium1 merged 2 commits into
mainfrom
salvage/pr-23071-extend-live-claim
May 10, 2026
Merged

fix(kanban): extend stale claim instead of killing live worker (salvage #23071)#23442
teknium1 merged 2 commits into
mainfrom
salvage/pr-23071-extend-live-claim

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Stops the kanban dispatcher from killing healthy workers that are slow. Workers running slow models (kimi-k2.6 was the reported case) can spend longer than the 15-min DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call — they make no tool calls, so they don't heartbeat, so the dispatcher used to mark the claim stale and SIGTERM the worker mid-flight. The respawned worker hit the same trap, producing the empty-workspace stall loop reported in #23025.

How

release_stale_claims() now checks if the worker's host-local PID is alive before reclaiming. If alive: extend the claim by another DEFAULT_CLAIM_TTL_SECONDS and emit a claim_extended event. If dead (or non-host-local): reclaim as before.

Upper bounds are unchanged:

  • enforce_max_runtime still hard-caps task runtime per the max_runtime_seconds column (catches genuinely-stuck-but-PID-alive workers — deadlocks, infinite loops).
  • detect_crashed_workers still reaps workers whose PID has vanished.

The host-local check (lock.startswith(host_prefix) from _claimer_id().split(":", 1)[0]) means we only trust _pid_alive when the lock was set by THIS host. Cross-host claims (rare; happens if you migrate the kanban DB between machines) fall through to the normal reclaim path because we can't safely interpret a PID number from a different host.

Changes

  • hermes_cli/kanban_db.py::release_stale_claims: add the live-PID extension branch with a CAS-guarded UPDATE, run-row sync, and claim_extended event. The reclaim path's payload is also enriched with claim_expires, last_heartbeat_at, worker_pid, host_local, and now so operators can tell from task_events whether a kill was timing-driven or a genuinely-dead worker.
  • tests/hermes_cli/test_kanban_db.py: 2 new tests (test_stale_claim_with_live_pid_extends_instead_of_reclaiming + test_stale_claim_reclaim_event_records_diagnostic_payload) and the existing test_stale_claim_reclaimed flipped to _pid_alive=False so it exercises the still-correct dead-PID reclaim path.

Validation

Before After
tests/hermes_cli/test_kanban_db.py (stale-claim/reclaim subset) 2/2 4/4
tests/hermes_cli/test_kanban_db.py + test_kanban_core_functionality.py + tests/tools/test_kanban_tools.py 288/288 290/290

Closes #23025 via salvage. Salvage of #23071. Original commit by @konsisumer cherry-picked with authorship preserved (re-attributed during salvage from der@konsi.org to the GitHub-noreply form for release-notes credit). AUTHOR_MAP entry added.

konsisumer and others added 2 commits May 10, 2026 15:22
Workers running slow models (e.g. kimi-k2.6) can spend longer than
DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making
no tool calls and therefore not heartbeating. release_stale_claims
previously reclaimed these healthy workers, producing the
spawn-then-immediately-reclaim loop reported in #23025.

When a stale-by-TTL claim's host-local worker PID is still alive,
extend the claim (emit a claim_extended event) rather than killing
it. enforce_max_runtime / detect_crashed_workers remain the upper
bounds for genuinely wedged or dead workers. Reclaim events now also
record claim_expires, last_heartbeat_at, worker_pid, and host_local
so operators can see why a worker was killed.
@teknium1 teknium1 merged commit 59d3f24 into main May 10, 2026
12 of 15 checks passed
@teknium1 teknium1 deleted the salvage/pr-23071-extend-live-claim branch May 10, 2026 22:23
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: salvage/pr-23071-extend-live-claim vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8078 on HEAD, 8075 on base (🆕 +3)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4255 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

steveonjava added a commit to steveonjava/hermes-agent that referenced this pull request May 24, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
NousResearch#22926 (stale-claim auto-cleanup).
kshitijk4poor pushed a commit that referenced this pull request May 27, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR #23442 (`release_stale_claims`, closes #23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: #20015 (`_pid_alive` false-negative behaviour),
mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR #23442 (`release_stale_claims`, closes #23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: #20015 (`_pid_alive` false-negative behaviour),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns

2 participants