Skip to content

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops #29747

@baofuen

Description

@baofuen

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

We hit a real task that cycled through blocked three times over ~3 hours, each cycle requiring manual unblock to resume — and the system never escalated, never tripped failure_limit, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:

  1. _rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
  2. Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
  3. release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Reproduction (real task, summarized)

run duration terminating event how it ended
1 26 min worker called kanban_block with review-required handoff status=blocked
2 11 min Iteration budget exhausted (80/80) → worker emitted kanban_block status=blocked
3 91 min worker self-detected logic loop, killed its child PID, emitted kanban_block with partial-progress note status=blocked

Throughout, consecutive_failures stayed at 0 (none of the three outcomes counted). _rule_stuck_in_blocked never fired because each unblock reset its timer well under the 24h default. release_stale_claims extended the claim every 15 min during run 3 because _pid_alive was true; last_heartbeat_at had been stale for over an hour but was only recorded into the event payload, not consulted for the decision.

Net effect: the system has zero automated stop signal for "this task has been bouncing in/out of blocked repeatedly" — only a tired human noticing.

Proposed fixes

Gap 1 — add a count-based sibling to _rule_stuck_in_blocked

A new rule, e.g. _rule_blocked_thrashing, that fires when count(events.kind='blocked' for this task) >= N regardless of recency. Suggested:

kanban:
  blocked_count_limit: 3       # warning at 3, error at 5

Or alternatively, count blocked outcomes into consecutive_failures when the reason is a known auto-block (iteration_exhausted, worker-self-stuck-detected, etc.) rather than a review-required handoff. See Gap 2.

Gap 2 — taxonomize blocked reasons and feed auto-block outcomes into consecutive_failures

Today kanban_block is one channel for both:

  • Intentional review-required handoffs (the worker is healthy and waiting on a human)
  • Defensive self-reports of failure (budget exhausted, self-detected loop, stuck-too-long)

_rule_consecutive_failures shouldn't treat these the same. Suggestion: add a block_kind field (review_required | auto_failure) to the kanban_block payload, and have _record_run_outcome map auto_failure blocks to consecutive_failures += 1. The failure_limit breaker then catches Gap 2 naturally.

Minimal version: hardcode "Iteration budget exhausted" as auto-failure for now; add block_kind later.

Gap 3 — make last_heartbeat_at an upper bound on claim_extended

Currently release_stale_claims only checks _pid_alive. The fix is small: extend the claim only if both _pid_alive(pid) and now - last_heartbeat_at < HEARTBEAT_STALE_SECONDS (suggested default 1800s, configurable). If the heartbeat is older than the threshold, fall through to the normal reclaim path — that branch already handles a "live but unresponsive" worker (SIGTERM/SIGKILL).

This preserves the #23025 design intent (don't kill on TTL-alone for slow LLMs that are mid-call) while still bounding "PID alive but agent looped". As a second-order benefit it makes enforce_max_runtime non-essential for stuck-detection — it can remain an explicit per-task SLA cap.

Affected files

  • hermes_cli/kanban_diagnostics.py_rule_stuck_in_blocked (count-based sibling), _rule_consecutive_failures (treat auto-block as failure)
  • hermes_cli/kanban_db.pyrelease_stale_claims (~L2384, gate claim_extended on heartbeat freshness), _record_run_outcome (block_kind plumbing)
  • gateway/run.pyiteration_budget exhaustion path (emit block_kind=auto_failure)

Why we're filing this

In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.

Happy to send a PR for Gap 3 (smallest, most contained) if the design direction looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/toolsTool registry, model_tools, toolsetstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions