kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

# kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

## Summary

We hit a real task that cycled through `blocked` three times over ~3 hours, each cycle requiring manual `unblock` to resume — and the system never escalated, never tripped `failure_limit`, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:

1. `_rule_stuck_in_blocked` only counts **single-blocked age**, and any `commented` / `unblocked` event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
2. `Iteration budget exhausted` maps to `kanban_block` (status=`blocked`), but `_rule_consecutive_failures` explicitly excludes `blocked` outcome (see `kanban_diagnostics.py` line ~696: *"Other outcomes (timed_out, blocked, spawn_failed, gave_up)"* — they're skipped). So budget-exhausted runs never increment `consecutive_failures` and the `kanban.failure_limit=5` (`DEFAULT_FAILURE_LIMIT`) breaker is bypassed.
3. `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores the `last_heartbeat_at` it reads from the row (see `kanban_db.py` ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is `enforce_max_runtime`. But `enforce_max_runtime` is **opt-in per task** (`max_runtime_seconds` defaults to `NULL`) — a task created without that field has *no upper bound at all* on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for **91 minutes** with `last_heartbeat_at` frozen at `t+10min` because the worker entered a logic loop with no tool calls.

## Reproduction (real task, summarized)

| run | duration | terminating event | how it ended |
|---|---|---|---|
| 1 | 26 min | worker called `kanban_block` with `review-required` handoff | status=`blocked` |
| 2 | 11 min | `Iteration budget exhausted (80/80)` → worker emitted `kanban_block` | status=`blocked` |
| 3 | 91 min | worker self-detected logic loop, killed its child PID, emitted `kanban_block` with partial-progress note | status=`blocked` |

Throughout, `consecutive_failures` stayed at 0 (none of the three outcomes counted). `_rule_stuck_in_blocked` never fired because each `unblock` reset its timer well under the 24h default. `release_stale_claims` extended the claim every 15 min during run 3 because `_pid_alive` was true; `last_heartbeat_at` had been stale for over an hour but was only recorded into the event payload, not consulted for the decision.

Net effect: the system has zero automated stop signal for "this task has been bouncing in/out of `blocked` repeatedly" — only a tired human noticing.

## Proposed fixes

### Gap 1 — add a *count-based* sibling to `_rule_stuck_in_blocked`

A new rule, e.g. `_rule_blocked_thrashing`, that fires when `count(events.kind='blocked' for this task) >= N` regardless of recency. Suggested:

```yaml
kanban:
  blocked_count_limit: 3       # warning at 3, error at 5
```

Or alternatively, count `blocked` outcomes into `consecutive_failures` when the reason is a known auto-block (iteration_exhausted, worker-self-stuck-detected, etc.) rather than a `review-required` handoff. See Gap 2.

### Gap 2 — taxonomize `blocked` reasons and feed auto-block outcomes into `consecutive_failures`

Today `kanban_block` is one channel for both:
- **Intentional** review-required handoffs (the worker is healthy and waiting on a human)
- **Defensive** self-reports of failure (budget exhausted, self-detected loop, stuck-too-long)

`_rule_consecutive_failures` shouldn't treat these the same. Suggestion: add a `block_kind` field (`review_required` | `auto_failure`) to the `kanban_block` payload, and have `_record_run_outcome` map `auto_failure` blocks to `consecutive_failures += 1`. The `failure_limit` breaker then catches Gap 2 naturally.

Minimal version: hardcode `"Iteration budget exhausted"` as auto-failure for now; add `block_kind` later.

### Gap 3 — make `last_heartbeat_at` an upper bound on `claim_extended`

Currently `release_stale_claims` only checks `_pid_alive`. The fix is small: extend the claim only if **both** `_pid_alive(pid)` and `now - last_heartbeat_at < HEARTBEAT_STALE_SECONDS` (suggested default 1800s, configurable). If the heartbeat is older than the threshold, fall through to the normal reclaim path — that branch already handles a "live but unresponsive" worker (SIGTERM/SIGKILL).

This preserves the #23025 design intent (don't kill on TTL-alone for slow LLMs that are mid-call) while still bounding "PID alive but agent looped". As a second-order benefit it makes `enforce_max_runtime` non-essential for stuck-detection — it can remain an explicit per-task SLA cap.

## Affected files

- `hermes_cli/kanban_diagnostics.py` — `_rule_stuck_in_blocked` (count-based sibling), `_rule_consecutive_failures` (treat auto-block as failure)
- `hermes_cli/kanban_db.py` — `release_stale_claims` (~L2384, gate `claim_extended` on heartbeat freshness), `_record_run_outcome` (block_kind plumbing)
- `gateway/run.py` — `iteration_budget` exhaustion path (emit `block_kind=auto_failure`)

## Why we're filing this

In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.

Happy to send a PR for Gap 3 (smallest, most contained) if the design direction looks right.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops #29747

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

Reproduction (real task, summarized)

Proposed fixes

Gap 1 — add a count-based sibling to `_rule_stuck_in_blocked`

Gap 2 — taxonomize `blocked` reasons and feed auto-block outcomes into `consecutive_failures`

Gap 3 — make `last_heartbeat_at` an upper bound on `claim_extended`

Affected files

Why we're filing this

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

run	duration	terminating event	how it ended
1	26 min	worker called `kanban_block` with `review-required` handoff	status=`blocked`
2	11 min	`Iteration budget exhausted (80/80)` → worker emitted `kanban_block`	status=`blocked`
3	91 min	worker self-detected logic loop, killed its child PID, emitted `kanban_block` with partial-progress note	status=`blocked`

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops #29747

Description

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

Reproduction (real task, summarized)

Proposed fixes

Gap 1 — add a count-based sibling to _rule_stuck_in_blocked

Gap 2 — taxonomize blocked reasons and feed auto-block outcomes into consecutive_failures

Gap 3 — make last_heartbeat_at an upper bound on claim_extended

Affected files

Why we're filing this

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Gap 1 — add a count-based sibling to `_rule_stuck_in_blocked`

Gap 2 — taxonomize `blocked` reasons and feed auto-block outcomes into `consecutive_failures`

Gap 3 — make `last_heartbeat_at` an upper bound on `claim_extended`