fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747) by teknium1 · Pull Request #34428 · NousResearch/hermes-agent

teknium1 · 2026-05-29T06:41:09Z

Summary

Reporter diagnosed three independent gaps that together allow infinite "unblock → re-stuck" loops with no surfacing or escalation. All three are real on current main; this PR closes all three.

Changes

Gap 1 — _rule_stuck_in_blocked resets timer on any commented/unblocked

hermes_cli/kanban_diagnostics.py: new _rule_block_unblock_cycling complementary rule. Counts block→unblock cycles in a sliding window (default 3 cycles in 24h, configurable via block_cycle_threshold / block_cycle_window_seconds). Walks events in arrival order since multiple events share created_at seconds.

Gap 2 — Budget-exhausted runs bypass failure_limit circuit breaker

agent/conversation_loop.py: budget-exhaustion path now calls _record_task_failure(outcome="timed_out") instead of kanban_block. Budget exhaustion is genuinely timeout-shaped — it routes through the unified failure counter so repeated exhaustions trip the breaker.

Gap 3 — release_stale_claims ignores last_heartbeat_at

hermes_cli/kanban_db.py: heartbeat-stale backstop. If last_heartbeat_at is set AND older than DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS (1h), reclaim even if PID is alive. NULL heartbeat preserves backward compat. Reclaim event payload includes heartbeat_stale flag.

Pairs cleanly with PR #34418 (#31752 runtime → heartbeat bridge): once _touch_activity keeps heartbeats fresh via normal API traffic, this backstop only fires for genuinely wedged workers.

Author map: added @baofuen.

Validation

Scenario	Before	After
4 block→unblock cycles	invisible — `_rule_stuck_in_blocked` reset on each unblock	`block_unblock_cycling` warning, count=4
2 budget exhaustions (`failure_limit=2`)	`kanban_block` x2, `consecutive_failures=0` after each — task re-spawns forever	`_record_task_failure(timed_out)` x2 → `gave_up` event, `status=blocked`
Live PID + heartbeat 1h+ stale	extended indefinitely	reclaimed (`heartbeat_stale=true` in event)
Live PID + fresh heartbeat	extended (unchanged)	extended (unchanged)
Live PID + NULL heartbeat	extended (backward compat)	extended (backward compat)
Targeted suites: kanban_db, kanban_diagnostics, kanban_cli, kanban_tools, kanban_worker_runs, budget_config	405 pass	405 pass
`tests/agent/` (3629 tests)	pass	pass

Closes #29747. Credit @baofuen for the precise three-gap diagnosis and the reproduction.

Infographic

github-actions · 2026-05-29T06:42:01Z

🔎 Lint report: `fix/kanban-restuck-loops-29747` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9425 on HEAD, 9425 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4891 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

…29747) Reporter diagnosed three independent gaps that together allowed infinite 'unblock → re-stuck' loops with no surfacing or escalation: GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked` event, so a task that cycles every few minutes is invisible to it regardless of how many times it cycles. Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`) that counts block→unblock cycles in a sliding window. Default threshold 3 cycles within 24h, configurable via `block_cycle_threshold` / `block_cycle_window_seconds`. Walks events in arrival order (event id) since multiple events can share the same `created_at` second. Fires as a warning with a CLI hint to inspect the block reasons. GAP 2: Iteration-budget-exhausted runs in kanban workers map to `kanban_block` (status=blocked, but a clean exit from the kernel's perspective). `_rule_repeated_failures` reads `consecutive_failures`, which `_record_task_failure` increments only for crashed/timed_out/ spawn_failed — `blocked` outcome bypasses the failure counter, so the `kanban.failure_limit` circuit breaker never trips on budget-exhaustion loops. Fix: `agent/conversation_loop.py` budget-exhaustion path now calls `_record_task_failure(outcome="timed_out")` instead of `kanban_block`. Budget exhaustion is genuinely a timeout-shaped failure (the task ran out of allowed iterations), so this is more honest semantics; it also routes through the unified failure counter, so repeated budget exhaustions trip the circuit breaker and the task auto-blocks with `gave_up` after `failure_limit` retries. GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores `last_heartbeat_at`. Reporter observed a 91-min run that held its claim with frozen heartbeat because the worker entered a logic loop with no tool calls — `_pid_alive` kept returning True so the claim was extended every 15 minutes indefinitely. Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim even if the PID is alive. NULL `last_heartbeat_at` preserves backward compatibility (no heartbeat yet = extend, as before). The reclaim event payload now includes a `heartbeat_stale` boolean so operators see why a live-PID worker was reclaimed. This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a side effect of normal API traffic, the backstop only fires for genuinely wedged workers (no chunks, no tool results, no progress at all). Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>

The two tests in TestRunConversation now verify the new behavior: - test_kanban_block_called_on_iteration_exhaustion → verifies _record_task_failure(outcome='timed_out') is called instead of kanban_block - test_no_kanban_block_when_not_in_kanban_mode → verifies the bridge is a no-op when HERMES_KANBAN_TASK is unset The function names are kept for diff stability; both assert against _record_task_failure now, which is the correct contract per the gap-2 fix in this PR.

alt-glitch added P3 Low — cosmetic, nice to have type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder labels May 29, 2026

teknium1 and others added 3 commits May 29, 2026 00:07

chore: trigger CI

79e8d04

teknium1 force-pushed the fix/kanban-restuck-loops-29747 branch from 26e5b28 to 79e8d04 Compare May 29, 2026 07:07

teknium1 mentioned this pull request May 29, 2026

Kanban needs first-class waiting states for human, approval, and review gates #29171

Closed

teknium1 merged commit 7d10105 into main May 29, 2026
24 of 26 checks passed

teknium1 deleted the fix/kanban-restuck-loops-29747 branch May 29, 2026 07:13

teknium1 mentioned this pull request May 29, 2026

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops #29747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747)#34428

fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747)#34428
teknium1 merged 3 commits into
mainfrom
fix/kanban-restuck-loops-29747

teknium1 commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Infographic

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: fix/kanban-restuck-loops-29747 vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teknium1 commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

🔎 Lint report: `fix/kanban-restuck-loops-29747` vs `origin/main`