Feature request: circuit-breaker on repeated worker bails with identical block reason
Summary
When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows.
Repro
- Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated)
- Worker claims, observes the unchanged external condition, runs
kanban_block with a reason like "CI queued, 0 progress, unchanged"
- Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason
- Loop continues until something external changes or a human intervenes
Expected
After N consecutive bails (suggest N=5) with substantially-identical block reasons, the dispatcher should:
- Auto-pause the task (status=
blocked, no auto re-claim)
- Force an orchestrator handoff (or escalate if a handoff already exists and is past its SLA)
- Surface in
hermes kanban list with a distinct diagnostic flag (e.g. circuit_open)
This is distinct from max_retries (which counts run failures, not voluntary bails) and from the triage-watcher handoff escalation (which triggers at 60min but does not stop the re-claim loop).
Actual
Observed in production: ~230 worker spawn cycles across 4 tasks over a ~7 hour window during a CI runner saturation event. Each spawn was a full agent boot + context load + situation re-discovery, all bailing in 24-30 seconds on the same unchanged external condition. The triage watcher did correctly escalate orchestrator handoffs at the 60-minute mark, but the dispatcher kept re-claiming the source tasks because nothing stopped it.
Sample event sequence (one task, abbreviated):
#107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged"
#108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged"
#109 blocked → "100th run — same CI infra wall, unchanged"
... (continues)
Suspected cause
kanban_block records the reason but doesn't track consecutive-with-same-reason counts on the task. Dispatcher's claim selection only considers status=ready|blocked-with-unblock-time-passed and doesn't penalize tasks that have repeatedly bailed on the same external condition.
Suggested implementation sketch
- Track
consecutive_identical_bails counter on the task, incremented when a new block event's reason is fuzzy-matched to the prior one (or simply substring-matched on a normalized form)
- Reset counter when the block reason changes substantively, when a comment is added by a non-worker (human/orchestrator intervention), or when the task transitions through
done
- At
consecutive_identical_bails >= N (default 5, configurable), refuse to re-claim and emit a circuit_open diagnostic + force a triage-watcher handoff if one doesn't exist
Workaround
Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the kanban-orchestrator skill, but easy to miss when the orchestrator is on a long side-quest.
Related
- Triage watcher orchestrator-handoff SLA (60min) — works correctly but is downstream of the loop, not the loop itself
max_retries on tasks — counts failures, not voluntary bails, so doesn't trip on this pattern
Feature request: circuit-breaker on repeated worker bails with identical block reason
Summary
When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows.
Repro
kanban_blockwith a reason like "CI queued, 0 progress, unchanged"Expected
After N consecutive bails (suggest N=5) with substantially-identical block reasons, the dispatcher should:
blocked, no auto re-claim)hermes kanban listwith a distinct diagnostic flag (e.g.circuit_open)This is distinct from
max_retries(which counts run failures, not voluntary bails) and from the triage-watcher handoff escalation (which triggers at 60min but does not stop the re-claim loop).Actual
Observed in production: ~230 worker spawn cycles across 4 tasks over a ~7 hour window during a CI runner saturation event. Each spawn was a full agent boot + context load + situation re-discovery, all bailing in 24-30 seconds on the same unchanged external condition. The triage watcher did correctly escalate orchestrator handoffs at the 60-minute mark, but the dispatcher kept re-claiming the source tasks because nothing stopped it.
Sample event sequence (one task, abbreviated):
Suspected cause
kanban_blockrecords the reason but doesn't track consecutive-with-same-reason counts on the task. Dispatcher's claim selection only considersstatus=ready|blocked-with-unblock-time-passedand doesn't penalize tasks that have repeatedly bailed on the same external condition.Suggested implementation sketch
consecutive_identical_bailscounter on the task, incremented when a new block event's reason is fuzzy-matched to the prior one (or simply substring-matched on a normalized form)doneconsecutive_identical_bails >= N(default 5, configurable), refuse to re-claim and emit acircuit_opendiagnostic + force a triage-watcher handoff if one doesn't existWorkaround
Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the
kanban-orchestratorskill, but easy to miss when the orchestrator is on a long side-quest.Related
max_retrieson tasks — counts failures, not voluntary bails, so doesn't trip on this pattern