You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops
Summary
We hit a real task that cycled through blocked three times over ~3 hours, each cycle requiring manual unblock to resume — and the system never escalated, never tripped failure_limit, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:
_rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.
Reproduction (real task, summarized)
run
duration
terminating event
how it ended
1
26 min
worker called kanban_block with review-required handoff
worker self-detected logic loop, killed its child PID, emitted kanban_block with partial-progress note
status=blocked
Throughout, consecutive_failures stayed at 0 (none of the three outcomes counted). _rule_stuck_in_blocked never fired because each unblock reset its timer well under the 24h default. release_stale_claims extended the claim every 15 min during run 3 because _pid_alive was true; last_heartbeat_at had been stale for over an hour but was only recorded into the event payload, not consulted for the decision.
Net effect: the system has zero automated stop signal for "this task has been bouncing in/out of blocked repeatedly" — only a tired human noticing.
Proposed fixes
Gap 1 — add a count-based sibling to _rule_stuck_in_blocked
A new rule, e.g. _rule_blocked_thrashing, that fires when count(events.kind='blocked' for this task) >= N regardless of recency. Suggested:
kanban:
blocked_count_limit: 3# warning at 3, error at 5
Or alternatively, count blocked outcomes into consecutive_failures when the reason is a known auto-block (iteration_exhausted, worker-self-stuck-detected, etc.) rather than a review-required handoff. See Gap 2.
Gap 2 — taxonomize blocked reasons and feed auto-block outcomes into consecutive_failures
Today kanban_block is one channel for both:
Intentional review-required handoffs (the worker is healthy and waiting on a human)
Defensive self-reports of failure (budget exhausted, self-detected loop, stuck-too-long)
_rule_consecutive_failures shouldn't treat these the same. Suggestion: add a block_kind field (review_required | auto_failure) to the kanban_block payload, and have _record_run_outcome map auto_failure blocks to consecutive_failures += 1. The failure_limit breaker then catches Gap 2 naturally.
Minimal version: hardcode "Iteration budget exhausted" as auto-failure for now; add block_kind later.
Gap 3 — make last_heartbeat_at an upper bound on claim_extended
Currently release_stale_claims only checks _pid_alive. The fix is small: extend the claim only if both_pid_alive(pid) and now - last_heartbeat_at < HEARTBEAT_STALE_SECONDS (suggested default 1800s, configurable). If the heartbeat is older than the threshold, fall through to the normal reclaim path — that branch already handles a "live but unresponsive" worker (SIGTERM/SIGKILL).
This preserves the #23025 design intent (don't kill on TTL-alone for slow LLMs that are mid-call) while still bounding "PID alive but agent looped". As a second-order benefit it makes enforce_max_runtime non-essential for stuck-detection — it can remain an explicit per-task SLA cap.
Affected files
hermes_cli/kanban_diagnostics.py — _rule_stuck_in_blocked (count-based sibling), _rule_consecutive_failures (treat auto-block as failure)
In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.
Happy to send a PR for Gap 3 (smallest, most contained) if the design direction looks right.
kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops
Summary
We hit a real task that cycled through
blockedthree times over ~3 hours, each cycle requiring manualunblockto resume — and the system never escalated, never trippedfailure_limit, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:_rule_stuck_in_blockedonly counts single-blocked age, and anycommented/unblockedevent resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.Iteration budget exhaustedmaps tokanban_block(status=blocked), but_rule_consecutive_failuresexplicitly excludesblockedoutcome (seekanban_diagnostics.pyline ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never incrementconsecutive_failuresand thekanban.failure_limit=5(DEFAULT_FAILURE_LIMIT) breaker is bypassed.release_stale_claimsuses_pid_alive(worker_pid)only and ignores thelast_heartbeat_atit reads from the row (seekanban_db.py~L2384). This is deliberate per issue recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop isenforce_max_runtime. Butenforce_max_runtimeis opt-in per task (max_runtime_secondsdefaults toNULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes withlast_heartbeat_atfrozen att+10minbecause the worker entered a logic loop with no tool calls.Reproduction (real task, summarized)
kanban_blockwithreview-requiredhandoffblockedIteration budget exhausted (80/80)→ worker emittedkanban_blockblockedkanban_blockwith partial-progress noteblockedThroughout,
consecutive_failuresstayed at 0 (none of the three outcomes counted)._rule_stuck_in_blockednever fired because eachunblockreset its timer well under the 24h default.release_stale_claimsextended the claim every 15 min during run 3 because_pid_alivewas true;last_heartbeat_athad been stale for over an hour but was only recorded into the event payload, not consulted for the decision.Net effect: the system has zero automated stop signal for "this task has been bouncing in/out of
blockedrepeatedly" — only a tired human noticing.Proposed fixes
Gap 1 — add a count-based sibling to
_rule_stuck_in_blockedA new rule, e.g.
_rule_blocked_thrashing, that fires whencount(events.kind='blocked' for this task) >= Nregardless of recency. Suggested:Or alternatively, count
blockedoutcomes intoconsecutive_failureswhen the reason is a known auto-block (iteration_exhausted, worker-self-stuck-detected, etc.) rather than areview-requiredhandoff. See Gap 2.Gap 2 — taxonomize
blockedreasons and feed auto-block outcomes intoconsecutive_failuresToday
kanban_blockis one channel for both:_rule_consecutive_failuresshouldn't treat these the same. Suggestion: add ablock_kindfield (review_required|auto_failure) to thekanban_blockpayload, and have_record_run_outcomemapauto_failureblocks toconsecutive_failures += 1. Thefailure_limitbreaker then catches Gap 2 naturally.Minimal version: hardcode
"Iteration budget exhausted"as auto-failure for now; addblock_kindlater.Gap 3 — make
last_heartbeat_atan upper bound onclaim_extendedCurrently
release_stale_claimsonly checks_pid_alive. The fix is small: extend the claim only if both_pid_alive(pid)andnow - last_heartbeat_at < HEARTBEAT_STALE_SECONDS(suggested default 1800s, configurable). If the heartbeat is older than the threshold, fall through to the normal reclaim path — that branch already handles a "live but unresponsive" worker (SIGTERM/SIGKILL).This preserves the #23025 design intent (don't kill on TTL-alone for slow LLMs that are mid-call) while still bounding "PID alive but agent looped". As a second-order benefit it makes
enforce_max_runtimenon-essential for stuck-detection — it can remain an explicit per-task SLA cap.Affected files
hermes_cli/kanban_diagnostics.py—_rule_stuck_in_blocked(count-based sibling),_rule_consecutive_failures(treat auto-block as failure)hermes_cli/kanban_db.py—release_stale_claims(~L2384, gateclaim_extendedon heartbeat freshness),_record_run_outcome(block_kind plumbing)gateway/run.py—iteration_budgetexhaustion path (emitblock_kind=auto_failure)Why we're filing this
In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.
Happy to send a PR for Gap 3 (smallest, most contained) if the design direction looks right.