fix(kanban): prevent infinite crash loop for parentless blocked tasks#29732
fix(kanban): prevent infinite crash loop for parentless blocked tasks#29732sirEven wants to merge 1 commit into
Conversation
The dispatcher's recompute_ready() promoted parentless blocked tasks back to 'ready' on every tick because all([]) is True (vacuous truth). This defeated the circuit breaker: a task would crash, get auto-blocked (gave_up event), then immediately be promoted back, have its consecutive_failures counter reset to 0, and the cycle repeats forever. The crash loop created hundreds of zombie processes per hour: gave_up → promote → claim → spawn → crash → gave_up → promote → ... Two fixes: 1. Vacuous-truth guard: recompute_ready() now skips promotion of parentless blocked tasks entirely. A task with no parents has no dependency to wait for; promoting it only re-queues a deterministic crash. The task stays blocked until an operator runs kanban_unblock (e.g. after fixing the underlying issue like a bad API key or missing skill). 2. Failure-counter reset guard: When a blocked task WITH parents is promoted (because a parent just completed), consecutive_failures is reset to 0 as before — the conditions that caused the failure may have changed. This preserves the original auto-recovery semantics for tasks that have a legitimate dependency chain. Parentless TODO tasks (status='todo') are still promoted to 'ready' normally — only the blocked → ready transition is guarded for tasks without parents. Also updates test_kanban_blocked_sticky.py: - test_unblock_clears_sticky_state_and_lets_block_recover now uses a task with parents, matching the new semantics where only parented blocked tasks auto-recover after circuit-breaker blocks. New test file test_kanban_crash_loop.py (8 tests): - test_parentless_circuit_breaker_block_is_not_promoted - test_parentless_circuit_breaker_block_stays_blocked_across_dispatch_ticks - test_parentless_task_with_manual_kanban_block_is_also_sticky - test_blocked_task_with_done_parents_still_auto_promotes - test_blocked_task_with_incomplete_parents_stays_blocked - test_parentless_blocked_task_failure_counter_not_reset - test_unblocked_parentless_task_becomes_ready - test_parentless_todo_task_is_promoted_normally
CI failures are pre-existing on
|
| Failure | On main? |
Caused by this PR? | Fix exists? |
|---|---|---|---|
| xai plugins test | Yes (pre-existing) | No | Yes, on ethie/faster-tests |
| hangup protection test | No (only on restructure branch) | No | Needs restructure branch work |
All 557 kanban tests pass. The 24,628 passing tests confirm our change is clean.
|
Closing in favor of upstream fixes that landed after this PR was opened:
Both approaches prevent the infinite loop. The upstream fix is more permissive (allows one retry before the breaker trips permanently), while this PR was stricter (never promotes parentless blocked tasks). Since the upstream fix is already merged and covers the same ground, this PR is superseded. |
Problem
The kanban dispatcher's
recompute_ready()promoted parentless blocked tasks back toreadyon every tick becauseall([]) is True(vacuous truth). This defeated the circuit breaker: a task would crash, get auto-blocked (gave_upevent), then immediately be promoted back, have itsconsecutive_failurescounter reset to 0, and the cycle repeats forever.Crash loop trace
Observed in production: 509 crashes on a single task, 8,343
gave_upevents, 500+ zombie processes created every 60 seconds. Each task withconsecutive_failuresreset to 1 on every cycle becausegave_uptrips atfailure_limit=1(protocol violations / systemic failures), thenrecompute_readyimmediately promotes and resets the counter.Root cause
In
recompute_ready()(kanban_db.py line ~2125):blockedtasks are considered for promotion (correct for dependent tasks whose parent just completed)_has_sticky_block()correctly skips worker-initiatedkanban_blockcalls, but circuit-breaker blocks (gave_upevent, non-sticky) pass throughparents = conn.execute("SELECT ... FROM task_links WHERE child_id = ?", (task_id,)).fetchall()returns[], andall([]) == True— vacuous truth promotes themconsecutive_failures = 0on promotion, so the breaker never accumulatesFix
Two changes:
1. Vacuous-truth guard (primary fix)
recompute_ready()now skips promotion of parentless blocked tasks. A task with no parents has no dependency to wait for; promoting it only re-queues a deterministic crash. These tasks stayblockeduntil an operator runskanban_unblock(e.g. after fixing the underlying issue: bad API key, missing skill, auth error).2. Failure-counter reset guard (belt-and-suspenders)
When a blocked task with parents is promoted (because a parent just completed),
consecutive_failuresis still reset to 0 — the conditions that caused the failure may have changed. This preserves the original auto-recovery semantics (#40c1decb3) for tasks with a legitimate dependency chain.What's NOT changed
todotasks still promote toreadynormally (the guard only applies toblocked→readytransitions)kanban_blockstays sticky (the kanban: dispatcher auto-promotes blocked task → respawn worker → protocol_violation loop #28712 guard is untouched)kanban_unblockstill works — operators can manually unblock parentless tasks after fixing the underlying issueTesting
test_kanban_crash_loop.pycovering all combinations:kanban_blockstays blocked (sticky + vacuous)todotask promotes normally (guard only affectsblocked)test_kanban_blocked_sticky.py:test_unblock_clears_sticky_state_and_lets_block_recovernow uses a task with parents to match the new semanticstest_kanban_db.py, 6 intest_kanban_blocked_sticky.py, plus others)