Skip to content

fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747)#34428

Merged
teknium1 merged 3 commits into
mainfrom
fix/kanban-restuck-loops-29747
May 29, 2026
Merged

fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747)#34428
teknium1 merged 3 commits into
mainfrom
fix/kanban-restuck-loops-29747

Conversation

@teknium1

@teknium1 teknium1 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Reporter diagnosed three independent gaps that together allow infinite "unblock → re-stuck" loops with no surfacing or escalation. All three are real on current main; this PR closes all three.

Changes

Gap 1 — _rule_stuck_in_blocked resets timer on any commented/unblocked

  • hermes_cli/kanban_diagnostics.py: new _rule_block_unblock_cycling complementary rule. Counts block→unblock cycles in a sliding window (default 3 cycles in 24h, configurable via block_cycle_threshold / block_cycle_window_seconds). Walks events in arrival order since multiple events share created_at seconds.

Gap 2 — Budget-exhausted runs bypass failure_limit circuit breaker

  • agent/conversation_loop.py: budget-exhaustion path now calls _record_task_failure(outcome="timed_out") instead of kanban_block. Budget exhaustion is genuinely timeout-shaped — it routes through the unified failure counter so repeated exhaustions trip the breaker.

Gap 3 — release_stale_claims ignores last_heartbeat_at

  • hermes_cli/kanban_db.py: heartbeat-stale backstop. If last_heartbeat_at is set AND older than DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS (1h), reclaim even if PID is alive. NULL heartbeat preserves backward compat. Reclaim event payload includes heartbeat_stale flag.

Pairs cleanly with PR #34418 (#31752 runtime → heartbeat bridge): once _touch_activity keeps heartbeats fresh via normal API traffic, this backstop only fires for genuinely wedged workers.

Author map: added @baofuen.

Validation

Scenario Before After
4 block→unblock cycles invisible — _rule_stuck_in_blocked reset on each unblock block_unblock_cycling warning, count=4
2 budget exhaustions (failure_limit=2) kanban_block x2, consecutive_failures=0 after each — task re-spawns forever _record_task_failure(timed_out) x2 → gave_up event, status=blocked
Live PID + heartbeat 1h+ stale extended indefinitely reclaimed (heartbeat_stale=true in event)
Live PID + fresh heartbeat extended (unchanged) extended (unchanged)
Live PID + NULL heartbeat extended (backward compat) extended (backward compat)
Targeted suites: kanban_db, kanban_diagnostics, kanban_cli, kanban_tools, kanban_worker_runs, budget_config 405 pass 405 pass
tests/agent/ (3629 tests) pass pass

Closes #29747. Credit @baofuen for the precise three-gap diagnosis and the reproduction.

Infographic

kanban-three-gaps

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/kanban-restuck-loops-29747 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9425 on HEAD, 9425 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4891 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

teknium1 and others added 3 commits May 29, 2026 00:07
…29747)

Reporter diagnosed three independent gaps that together allowed infinite
'unblock → re-stuck' loops with no surfacing or escalation:

GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked`
event, so a task that cycles every few minutes is invisible to it
regardless of how many times it cycles.

Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`)
that counts block→unblock cycles in a sliding window. Default threshold
3 cycles within 24h, configurable via `block_cycle_threshold` /
`block_cycle_window_seconds`. Walks events in arrival order (event id)
since multiple events can share the same `created_at` second. Fires as a
warning with a CLI hint to inspect the block reasons.

GAP 2: Iteration-budget-exhausted runs in kanban workers map to
`kanban_block` (status=blocked, but a clean exit from the kernel's
perspective). `_rule_repeated_failures` reads `consecutive_failures`,
which `_record_task_failure` increments only for crashed/timed_out/
spawn_failed — `blocked` outcome bypasses the failure counter, so the
`kanban.failure_limit` circuit breaker never trips on budget-exhaustion
loops.

Fix: `agent/conversation_loop.py` budget-exhaustion path now calls
`_record_task_failure(outcome="timed_out")` instead of `kanban_block`.
Budget exhaustion is genuinely a timeout-shaped failure (the task ran out
of allowed iterations), so this is more honest semantics; it also routes
through the unified failure counter, so repeated budget exhaustions trip
the circuit breaker and the task auto-blocks with `gave_up` after
`failure_limit` retries.

GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and
ignores `last_heartbeat_at`. Reporter observed a 91-min run that held
its claim with frozen heartbeat because the worker entered a logic loop
with no tool calls — `_pid_alive` kept returning True so the claim was
extended every 15 minutes indefinitely.

Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older
than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim
even if the PID is alive. NULL `last_heartbeat_at` preserves backward
compatibility (no heartbeat yet = extend, as before). The reclaim event
payload now includes a `heartbeat_stale` boolean so operators see why a
live-PID worker was reclaimed.

This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat
bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a
side effect of normal API traffic, the backstop only fires for genuinely
wedged workers (no chunks, no tool results, no progress at all).

Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>
The two tests in TestRunConversation now verify the new behavior:
  - test_kanban_block_called_on_iteration_exhaustion → verifies
    _record_task_failure(outcome='timed_out') is called instead of
    kanban_block
  - test_no_kanban_block_when_not_in_kanban_mode → verifies the bridge
    is a no-op when HERMES_KANBAN_TASK is unset

The function names are kept for diff stability; both assert against
_record_task_failure now, which is the correct contract per the gap-2
fix in this PR.
@teknium1 teknium1 force-pushed the fix/kanban-restuck-loops-29747 branch from 26e5b28 to 79e8d04 Compare May 29, 2026 07:07
@teknium1 teknium1 merged commit 7d10105 into main May 29, 2026
24 of 26 checks passed
@teknium1 teknium1 deleted the fix/kanban-restuck-loops-29747 branch May 29, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

2 participants