Skip to content

fix(kanban): extend stale claim instead of killing live worker#23071

Closed
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/kanban-skip-stale-reclaim-when-pid-alive
Closed

fix(kanban): extend stale claim instead of killing live worker#23071
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/kanban-skip-stale-reclaim-when-pid-alive

Conversation

@konsisumer

Copy link
Copy Markdown
Contributor

Stop reclaiming kanban tasks whose worker subprocess is still alive (#23025).

What changed and why

  • release_stale_claims now skips reclaim when the host-local worker_pid is alive, extending claim_expires by DEFAULT_CLAIM_TTL_SECONDS and emitting a new claim_extended event. Slow models (kimi-k2.6 in the report) can spend longer than the 15-min TTL inside a single tool-free LLM call, so kanban_heartbeat never fires; the previous behavior killed those healthy workers and respawned new ones that hit the same trap, producing the empty-workspace stall loop the reporter described.
  • enforce_max_runtime and detect_crashed_workers remain the upper bounds for genuinely wedged or dead workers — neither is touched here.
  • reclaimed events now carry claim_expires, last_heartbeat_at, worker_pid, host_local, and now, so operators can tell at a glance whether a kill was timing-driven or a worker that genuinely went away.

How to test

  • pytest tests/hermes_cli/test_kanban_db.py tests/tools/test_kanban_tools.py tests/stress/test_concurrency_reclaim_race.py -q --timeout=60 (112 passed locally).
  • New tests: test_stale_claim_with_live_pid_extends_instead_of_reclaiming (live PID → claim extended, no SIGTERM, claim_extended event emitted) and test_stale_claim_reclaim_event_records_diagnostic_payload (dead PID → reclaim event records expiry + heartbeat).
  • Existing test_stale_claim_reclaimed updated to simulate a dead PID, exercising the path that should still kill + reclaim.

What platforms tested on

  • macOS on darwin-arm64 (local)

Fixes #23025

Workers running slow models (e.g. kimi-k2.6) can spend longer than
DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making
no tool calls and therefore not heartbeating. release_stale_claims
previously reclaimed these healthy workers, producing the
spawn-then-immediately-reclaim loop reported in NousResearch#23025.

When a stale-by-TTL claim's host-local worker PID is still alive,
extend the claim (emit a claim_extended event) rather than killing
it. enforce_max_runtime / detect_crashed_workers remain the upper
bounds for genuinely wedged or dead workers. Reclaim events now also
record claim_expires, last_heartbeat_at, worker_pid, and host_local
so operators can see why a worker was killed.
@teknium1

Copy link
Copy Markdown
Contributor

Salvage merged via PR #23442 (rebase) — your fix shipped on main with your authorship preserved (re-attributed during salvage from der@konsi.org to the GitHub-noreply form so release notes credit your account). AUTHOR_MAP entry added.

Clean fix shape — host-local + alive-PID gate before reclaim, deferring to enforce_max_runtime and detect_crashed_workers as the upper bounds for genuinely-wedged or dead workers. The enriched reclaimed payload (claim_expires, last_heartbeat_at, worker_pid, host_local, now) is a nice operator-debugging touch.

Thanks @konsisumer!
#23442

@teknium1 teknium1 closed this May 10, 2026
JZKK720 pushed a commit to JZKK720/hermes-agent that referenced this pull request May 11, 2026
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Seven74AI pushed a commit to Seven74AI/hermes-agent that referenced this pull request Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns

3 participants