fix(kanban): prevent infinite crash loop for parentless blocked tasks by sirEven · Pull Request #29732 · NousResearch/hermes-agent

sirEven · 2026-05-21T08:36:17Z

Problem

The kanban dispatcher's recompute_ready() promoted parentless blocked tasks back to ready on every tick because all([]) is True (vacuous truth). This defeated the circuit breaker: a task would crash, get auto-blocked (gave_up event), then immediately be promoted back, have its consecutive_failures counter reset to 0, and the cycle repeats forever.

Crash loop trace

gave_up → promote → claim → spawn → crash → gave_up → promote → ...

Observed in production: 509 crashes on a single task, 8,343 gave_up events, 500+ zombie processes created every 60 seconds. Each task with consecutive_failures reset to 1 on every cycle because gave_up trips at failure_limit=1 (protocol violations / systemic failures), then recompute_ready immediately promotes and resets the counter.

Root cause

In recompute_ready() (kanban_db.py line ~2125):

blocked tasks are considered for promotion (correct for dependent tasks whose parent just completed)
_has_sticky_block() correctly skips worker-initiated kanban_block calls, but circuit-breaker blocks (gave_up event, non-sticky) pass through
For parentless tasks: parents = conn.execute("SELECT ... FROM task_links WHERE child_id = ?", (task_id,)).fetchall() returns [], and all([]) == True — vacuous truth promotes them
Line 2135-2139 resets consecutive_failures = 0 on promotion, so the breaker never accumulates

Fix

Two changes:

1. Vacuous-truth guard (primary fix)

recompute_ready() now skips promotion of parentless blocked tasks. A task with no parents has no dependency to wait for; promoting it only re-queues a deterministic crash. These tasks stay blocked until an operator runs kanban_unblock (e.g. after fixing the underlying issue: bad API key, missing skill, auth error).

has_parents = len(parents) > 0
if all(p["status"] in ("done", "archived") for p in parents):
    if cur_status == "blocked":
        if has_parents:
            # Original behavior: promote and reset counters
        else:
            # New: skip parentless blocked tasks entirely
            continue

2. Failure-counter reset guard (belt-and-suspenders)

When a blocked task with parents is promoted (because a parent just completed), consecutive_failures is still reset to 0 — the conditions that caused the failure may have changed. This preserves the original auto-recovery semantics (#40c1decb3) for tasks with a legitimate dependency chain.

What's NOT changed

Parentless todo tasks still promote to ready normally (the guard only applies to blocked → ready transitions)
Parented blocked tasks still auto-recover when their parent completes
Worker-initiated kanban_block stays sticky (the kanban: dispatcher auto-promotes blocked task → respawn worker → protocol_violation loop #28712 guard is untouched)
kanban_unblock still works — operators can manually unblock parentless tasks after fixing the underlying issue

Testing

8 new tests in test_kanban_crash_loop.py covering all combinations:
- Parentless circuit-breaker block stays blocked (×5 iterations, ×10 dispatch ticks)
- Parentless task with manual kanban_block stays blocked (sticky + vacuous)
- Parented blocked task with done parent auto-promotes (existing semantics preserved)
- Parented blocked task with incomplete parent stays blocked
- Parentless blocked task failure counter not reset
- Unblocked parentless task becomes ready (operator can still recover)
- Parentless todo task promotes normally (guard only affects blocked)
1 updated test in test_kanban_blocked_sticky.py:
- test_unblock_clears_sticky_state_and_lets_block_recover now uses a task with parents to match the new semantics
All 557 existing kanban tests pass (159 in test_kanban_db.py, 6 in test_kanban_blocked_sticky.py, plus others)

The dispatcher's recompute_ready() promoted parentless blocked tasks back to 'ready' on every tick because all([]) is True (vacuous truth). This defeated the circuit breaker: a task would crash, get auto-blocked (gave_up event), then immediately be promoted back, have its consecutive_failures counter reset to 0, and the cycle repeats forever. The crash loop created hundreds of zombie processes per hour: gave_up → promote → claim → spawn → crash → gave_up → promote → ... Two fixes: 1. Vacuous-truth guard: recompute_ready() now skips promotion of parentless blocked tasks entirely. A task with no parents has no dependency to wait for; promoting it only re-queues a deterministic crash. The task stays blocked until an operator runs kanban_unblock (e.g. after fixing the underlying issue like a bad API key or missing skill). 2. Failure-counter reset guard: When a blocked task WITH parents is promoted (because a parent just completed), consecutive_failures is reset to 0 as before — the conditions that caused the failure may have changed. This preserves the original auto-recovery semantics for tasks that have a legitimate dependency chain. Parentless TODO tasks (status='todo') are still promoted to 'ready' normally — only the blocked → ready transition is guarded for tasks without parents. Also updates test_kanban_blocked_sticky.py: - test_unblock_clears_sticky_state_and_lets_block_recover now uses a task with parents, matching the new semantics where only parented blocked tasks auto-recover after circuit-breaker blocks. New test file test_kanban_crash_loop.py (8 tests): - test_parentless_circuit_breaker_block_is_not_promoted - test_parentless_circuit_breaker_block_stays_blocked_across_dispatch_ticks - test_parentless_task_with_manual_kanban_block_is_also_sticky - test_blocked_task_with_done_parents_still_auto_promotes - test_blocked_task_with_incomplete_parents_stays_blocked - test_parentless_blocked_task_failure_counter_not_reset - test_unblocked_parentless_task_becomes_ready - test_parentless_todo_task_is_promoted_normally

sirEven · 2026-05-21T09:26:13Z

CI failures are pre-existing on `main`

Both test failures in the CI run are pre-existing on main and unrelated to this PR (which only touches kanban_db.py, test_kanban_blocked_sticky.py, and test_kanban_crash_loop.py):

1. `test_all_seven_plugins_present_in_registry`

Root cause: Commit a0c031299 (by Jaaneek, May 19) added the xai web search provider plugin but did not update the test expectations. The plugin correctly follows the kind: backend auto-registration pattern — it should be in the list.

Fix already exists: Commit 2fb3c1b3 (by ethernet8023, May 20) on branches ethie/faster-tests and ethie/test-isolation-subprocess comprehensively addresses this: updated docstring, added xai to expected list, added capability flags, added parametrized test entries. It will merge when those branches land.

No action needed here — this is a branch-coordination issue, not a bug in our change.

2. `test_wraps_stdout_and_stderr_with_mirror`

Root cause: This test passes on main. The failure occurs on the sid/foundational-restructure branch where all hermes_cli.* imports were rewritten to hermes_agent.*. The re-export chain (hermes_agent.constants.get_hermes_home → hermes_agent.cli.config.get_hermes_home) creates import-binding subtleties where monkeypatch.setattr on one re-export link does not propagate to the late-bound import inside _install_hangup_protection. The _HERMES_HOME_CACHE clearing in the test is dead code — the attribute does not exist on either branch.

No action needed here — this is the restructure branch team's problem, not ours.

Summary

Failure	On `main`?	Caused by this PR?	Fix exists?
xai plugins test	Yes (pre-existing)	No	Yes, on `ethie/faster-tests`
hangup protection test	No (only on restructure branch)	No	Needs restructure branch work

All 557 kanban tests pass. The 24,628 passing tests confirm our change is clean.

sirEven · 2026-06-11T23:16:50Z

Closing in favor of upstream fixes that landed after this PR was opened:

6ab71d3 — fix(kanban): prevent infinite retry loop when worker exhausts iteration budget — adds a failures >= effective_limit guard in recompute_ready and preserves consecutive_failures across recovery, preventing the circuit-breaker bypass that caused the crash loop.
8e5a685 — fix(kanban): align recompute_ready guard with breaker's configured failure_limit — follow-up ensuring the guard respects the dispatcher's configured kanban.failure_limit instead of hardcoded DEFAULT_FAILURE_LIMIT.

Both approaches prevent the infinite loop. The upstream fix is more permissive (allows one retry before the breaker trips permanently), while this PR was stricter (never promotes parentless blocked tasks). Since the upstream fix is already merged and covers the same ground, this PR is superseded.

alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 21, 2026

alt-glitch mentioned this pull request Jun 2, 2026

Circuit-breaker (gave_up) blocks are auto-resurrected by recompute_ready → deterministic-failure tasks loop forever #37689

Open

sirEven closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): prevent infinite crash loop for parentless blocked tasks#29732

fix(kanban): prevent infinite crash loop for parentless blocked tasks#29732
sirEven wants to merge 1 commit into
NousResearch:mainfrom
sirEven:fix/kanban-crash-loop-parentless-tasks

sirEven commented May 21, 2026

Uh oh!

sirEven commented May 21, 2026

Uh oh!

sirEven commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sirEven commented May 21, 2026

Problem

Crash loop trace

Root cause

Fix

1. Vacuous-truth guard (primary fix)

2. Failure-counter reset guard (belt-and-suspenders)

What's NOT changed

Testing

Uh oh!

sirEven commented May 21, 2026

CI failures are pre-existing on main

1. test_all_seven_plugins_present_in_registry

2. test_wraps_stdout_and_stderr_with_mirror

Summary

Uh oh!

sirEven commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CI failures are pre-existing on `main`

1. `test_all_seven_plugins_present_in_registry`

2. `test_wraps_stdout_and_stderr_with_mirror`