fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747)#34428
Merged
Conversation
Contributor
🔎 Lint report:
|
This was referenced May 29, 2026
Closed
…29747) Reporter diagnosed three independent gaps that together allowed infinite 'unblock → re-stuck' loops with no surfacing or escalation: GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked` event, so a task that cycles every few minutes is invisible to it regardless of how many times it cycles. Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`) that counts block→unblock cycles in a sliding window. Default threshold 3 cycles within 24h, configurable via `block_cycle_threshold` / `block_cycle_window_seconds`. Walks events in arrival order (event id) since multiple events can share the same `created_at` second. Fires as a warning with a CLI hint to inspect the block reasons. GAP 2: Iteration-budget-exhausted runs in kanban workers map to `kanban_block` (status=blocked, but a clean exit from the kernel's perspective). `_rule_repeated_failures` reads `consecutive_failures`, which `_record_task_failure` increments only for crashed/timed_out/ spawn_failed — `blocked` outcome bypasses the failure counter, so the `kanban.failure_limit` circuit breaker never trips on budget-exhaustion loops. Fix: `agent/conversation_loop.py` budget-exhaustion path now calls `_record_task_failure(outcome="timed_out")` instead of `kanban_block`. Budget exhaustion is genuinely a timeout-shaped failure (the task ran out of allowed iterations), so this is more honest semantics; it also routes through the unified failure counter, so repeated budget exhaustions trip the circuit breaker and the task auto-blocks with `gave_up` after `failure_limit` retries. GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores `last_heartbeat_at`. Reporter observed a 91-min run that held its claim with frozen heartbeat because the worker entered a logic loop with no tool calls — `_pid_alive` kept returning True so the claim was extended every 15 minutes indefinitely. Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim even if the PID is alive. NULL `last_heartbeat_at` preserves backward compatibility (no heartbeat yet = extend, as before). The reclaim event payload now includes a `heartbeat_stale` boolean so operators see why a live-PID worker was reclaimed. This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a side effect of normal API traffic, the backstop only fires for genuinely wedged workers (no chunks, no tool results, no progress at all). Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>
The two tests in TestRunConversation now verify the new behavior:
- test_kanban_block_called_on_iteration_exhaustion → verifies
_record_task_failure(outcome='timed_out') is called instead of
kanban_block
- test_no_kanban_block_when_not_in_kanban_mode → verifies the bridge
is a no-op when HERMES_KANBAN_TASK is unset
The function names are kept for diff stability; both assert against
_record_task_failure now, which is the correct contract per the gap-2
fix in this PR.
26e5b28 to
79e8d04
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reporter diagnosed three independent gaps that together allow infinite "unblock → re-stuck" loops with no surfacing or escalation. All three are real on current main; this PR closes all three.
Changes
Gap 1 —
_rule_stuck_in_blockedresets timer on anycommented/unblockedhermes_cli/kanban_diagnostics.py: new_rule_block_unblock_cyclingcomplementary rule. Counts block→unblock cycles in a sliding window (default 3 cycles in 24h, configurable viablock_cycle_threshold/block_cycle_window_seconds). Walks events in arrival order since multiple events sharecreated_atseconds.Gap 2 — Budget-exhausted runs bypass
failure_limitcircuit breakeragent/conversation_loop.py: budget-exhaustion path now calls_record_task_failure(outcome="timed_out")instead ofkanban_block. Budget exhaustion is genuinely timeout-shaped — it routes through the unified failure counter so repeated exhaustions trip the breaker.Gap 3 —
release_stale_claimsignoreslast_heartbeat_athermes_cli/kanban_db.py: heartbeat-stale backstop. Iflast_heartbeat_atis set AND older thanDEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS(1h), reclaim even if PID is alive. NULL heartbeat preserves backward compat. Reclaim event payload includesheartbeat_staleflag.Pairs cleanly with PR #34418 (#31752 runtime → heartbeat bridge): once
_touch_activitykeeps heartbeats fresh via normal API traffic, this backstop only fires for genuinely wedged workers.Author map: added @baofuen.
Validation
_rule_stuck_in_blockedreset on each unblockblock_unblock_cyclingwarning, count=4failure_limit=2)kanban_blockx2,consecutive_failures=0after each — task re-spawns forever_record_task_failure(timed_out)x2 →gave_upevent,status=blockedheartbeat_stale=truein event)tests/agent/(3629 tests)Closes #29747. Credit @baofuen for the precise three-gap diagnosis and the reproduction.
Infographic