Skip to content

kanban dispatcher: add circuit-breaker for repeated worker bails with identical block reason #29320

@akamel001

Description

@akamel001

Feature request: circuit-breaker on repeated worker bails with identical block reason

Summary

When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows.

Repro

  1. Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated)
  2. Worker claims, observes the unchanged external condition, runs kanban_block with a reason like "CI queued, 0 progress, unchanged"
  3. Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason
  4. Loop continues until something external changes or a human intervenes

Expected

After N consecutive bails (suggest N=5) with substantially-identical block reasons, the dispatcher should:

  • Auto-pause the task (status=blocked, no auto re-claim)
  • Force an orchestrator handoff (or escalate if a handoff already exists and is past its SLA)
  • Surface in hermes kanban list with a distinct diagnostic flag (e.g. circuit_open)

This is distinct from max_retries (which counts run failures, not voluntary bails) and from the triage-watcher handoff escalation (which triggers at 60min but does not stop the re-claim loop).

Actual

Observed in production: ~230 worker spawn cycles across 4 tasks over a ~7 hour window during a CI runner saturation event. Each spawn was a full agent boot + context load + situation re-discovery, all bailing in 24-30 seconds on the same unchanged external condition. The triage watcher did correctly escalate orchestrator handoffs at the 60-minute mark, but the dispatcher kept re-claiming the source tasks because nothing stopped it.

Sample event sequence (one task, abbreviated):

#107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged"
#108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged"
#109 blocked → "100th run — same CI infra wall, unchanged"
... (continues)

Suspected cause

kanban_block records the reason but doesn't track consecutive-with-same-reason counts on the task. Dispatcher's claim selection only considers status=ready|blocked-with-unblock-time-passed and doesn't penalize tasks that have repeatedly bailed on the same external condition.

Suggested implementation sketch

  • Track consecutive_identical_bails counter on the task, incremented when a new block event's reason is fuzzy-matched to the prior one (or simply substring-matched on a normalized form)
  • Reset counter when the block reason changes substantively, when a comment is added by a non-worker (human/orchestrator intervention), or when the task transitions through done
  • At consecutive_identical_bails >= N (default 5, configurable), refuse to re-claim and emit a circuit_open diagnostic + force a triage-watcher handoff if one doesn't exist

Workaround

Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the kanban-orchestrator skill, but easy to miss when the orchestrator is on a long side-quest.

Related

  • Triage watcher orchestrator-handoff SLA (60min) — works correctly but is downstream of the loop, not the loop itself
  • max_retries on tasks — counts failures, not voluntary bails, so doesn't trip on this pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions