Skip to content

fix(kanban): prevent infinite crash loop for parentless blocked tasks#29732

Closed
sirEven wants to merge 1 commit into
NousResearch:mainfrom
sirEven:fix/kanban-crash-loop-parentless-tasks
Closed

fix(kanban): prevent infinite crash loop for parentless blocked tasks#29732
sirEven wants to merge 1 commit into
NousResearch:mainfrom
sirEven:fix/kanban-crash-loop-parentless-tasks

Conversation

@sirEven

@sirEven sirEven commented May 21, 2026

Copy link
Copy Markdown
Contributor

Problem

The kanban dispatcher's recompute_ready() promoted parentless blocked tasks back to ready on every tick because all([]) is True (vacuous truth). This defeated the circuit breaker: a task would crash, get auto-blocked (gave_up event), then immediately be promoted back, have its consecutive_failures counter reset to 0, and the cycle repeats forever.

Crash loop trace

gave_up → promote → claim → spawn → crash → gave_up → promote → ...

Observed in production: 509 crashes on a single task, 8,343 gave_up events, 500+ zombie processes created every 60 seconds. Each task with consecutive_failures reset to 1 on every cycle because gave_up trips at failure_limit=1 (protocol violations / systemic failures), then recompute_ready immediately promotes and resets the counter.

Root cause

In recompute_ready() (kanban_db.py line ~2125):

  1. blocked tasks are considered for promotion (correct for dependent tasks whose parent just completed)
  2. _has_sticky_block() correctly skips worker-initiated kanban_block calls, but circuit-breaker blocks (gave_up event, non-sticky) pass through
  3. For parentless tasks: parents = conn.execute("SELECT ... FROM task_links WHERE child_id = ?", (task_id,)).fetchall() returns [], and all([]) == True — vacuous truth promotes them
  4. Line 2135-2139 resets consecutive_failures = 0 on promotion, so the breaker never accumulates

Fix

Two changes:

1. Vacuous-truth guard (primary fix)

recompute_ready() now skips promotion of parentless blocked tasks. A task with no parents has no dependency to wait for; promoting it only re-queues a deterministic crash. These tasks stay blocked until an operator runs kanban_unblock (e.g. after fixing the underlying issue: bad API key, missing skill, auth error).

has_parents = len(parents) > 0
if all(p["status"] in ("done", "archived") for p in parents):
    if cur_status == "blocked":
        if has_parents:
            # Original behavior: promote and reset counters
        else:
            # New: skip parentless blocked tasks entirely
            continue

2. Failure-counter reset guard (belt-and-suspenders)

When a blocked task with parents is promoted (because a parent just completed), consecutive_failures is still reset to 0 — the conditions that caused the failure may have changed. This preserves the original auto-recovery semantics (#40c1decb3) for tasks with a legitimate dependency chain.

What's NOT changed

Testing

  • 8 new tests in test_kanban_crash_loop.py covering all combinations:
    • Parentless circuit-breaker block stays blocked (×5 iterations, ×10 dispatch ticks)
    • Parentless task with manual kanban_block stays blocked (sticky + vacuous)
    • Parented blocked task with done parent auto-promotes (existing semantics preserved)
    • Parented blocked task with incomplete parent stays blocked
    • Parentless blocked task failure counter not reset
    • Unblocked parentless task becomes ready (operator can still recover)
    • Parentless todo task promotes normally (guard only affects blocked)
  • 1 updated test in test_kanban_blocked_sticky.py:
    • test_unblock_clears_sticky_state_and_lets_block_recover now uses a task with parents to match the new semantics
  • All 557 existing kanban tests pass (159 in test_kanban_db.py, 6 in test_kanban_blocked_sticky.py, plus others)

The dispatcher's recompute_ready() promoted parentless blocked tasks
back to 'ready' on every tick because all([]) is True (vacuous truth).
This defeated the circuit breaker: a task would crash, get auto-blocked
(gave_up event), then immediately be promoted back, have its
consecutive_failures counter reset to 0, and the cycle repeats forever.

The crash loop created hundreds of zombie processes per hour:
  gave_up → promote → claim → spawn → crash → gave_up → promote → ...

Two fixes:

1. Vacuous-truth guard: recompute_ready() now skips promotion of
   parentless blocked tasks entirely. A task with no parents has no
   dependency to wait for; promoting it only re-queues a deterministic
   crash. The task stays blocked until an operator runs
   kanban_unblock (e.g. after fixing the underlying issue like a bad
   API key or missing skill).

2. Failure-counter reset guard: When a blocked task WITH parents is
   promoted (because a parent just completed), consecutive_failures
   is reset to 0 as before — the conditions that caused the failure
   may have changed. This preserves the original auto-recovery semantics
   for tasks that have a legitimate dependency chain.

Parentless TODO tasks (status='todo') are still promoted to 'ready'
normally — only the blocked → ready transition is guarded for tasks
without parents.

Also updates test_kanban_blocked_sticky.py:
- test_unblock_clears_sticky_state_and_lets_block_recover now uses a
  task with parents, matching the new semantics where only parented
  blocked tasks auto-recover after circuit-breaker blocks.

New test file test_kanban_crash_loop.py (8 tests):
- test_parentless_circuit_breaker_block_is_not_promoted
- test_parentless_circuit_breaker_block_stays_blocked_across_dispatch_ticks
- test_parentless_task_with_manual_kanban_block_is_also_sticky
- test_blocked_task_with_done_parents_still_auto_promotes
- test_blocked_task_with_incomplete_parents_stays_blocked
- test_parentless_blocked_task_failure_counter_not_reset
- test_unblocked_parentless_task_becomes_ready
- test_parentless_todo_task_is_promoted_normally
@alt-glitch alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 21, 2026
@sirEven

sirEven commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

CI failures are pre-existing on main

Both test failures in the CI run are pre-existing on main and unrelated to this PR (which only touches kanban_db.py, test_kanban_blocked_sticky.py, and test_kanban_crash_loop.py):

1. test_all_seven_plugins_present_in_registry

Root cause: Commit a0c031299 (by Jaaneek, May 19) added the xai web search provider plugin but did not update the test expectations. The plugin correctly follows the kind: backend auto-registration pattern — it should be in the list.

Fix already exists: Commit 2fb3c1b3 (by ethernet8023, May 20) on branches ethie/faster-tests and ethie/test-isolation-subprocess comprehensively addresses this: updated docstring, added xai to expected list, added capability flags, added parametrized test entries. It will merge when those branches land.

No action needed here — this is a branch-coordination issue, not a bug in our change.

2. test_wraps_stdout_and_stderr_with_mirror

Root cause: This test passes on main. The failure occurs on the sid/foundational-restructure branch where all hermes_cli.* imports were rewritten to hermes_agent.*. The re-export chain (hermes_agent.constants.get_hermes_homehermes_agent.cli.config.get_hermes_home) creates import-binding subtleties where monkeypatch.setattr on one re-export link does not propagate to the late-bound import inside _install_hangup_protection. The _HERMES_HOME_CACHE clearing in the test is dead code — the attribute does not exist on either branch.

No action needed here — this is the restructure branch team's problem, not ours.

Summary

Failure On main? Caused by this PR? Fix exists?
xai plugins test Yes (pre-existing) No Yes, on ethie/faster-tests
hangup protection test No (only on restructure branch) No Needs restructure branch work

All 557 kanban tests pass. The 24,628 passing tests confirm our change is clean.

@sirEven

sirEven commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Closing in favor of upstream fixes that landed after this PR was opened:

  • 6ab71d3fix(kanban): prevent infinite retry loop when worker exhausts iteration budget — adds a failures >= effective_limit guard in recompute_ready and preserves consecutive_failures across recovery, preventing the circuit-breaker bypass that caused the crash loop.
  • 8e5a685fix(kanban): align recompute_ready guard with breaker's configured failure_limit — follow-up ensuring the guard respects the dispatcher's configured kanban.failure_limit instead of hardcoded DEFAULT_FAILURE_LIMIT.

Both approaches prevent the infinite loop. The upstream fix is more permissive (allows one retry before the breaker trips permanently), while this PR was stricter (never promotes parentless blocked tasks). Since the upstream fix is already merged and covers the same ground, this PR is superseded.

@sirEven sirEven closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants