Skip to content

[Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks #23216

@QuarkAssistant

Description

@QuarkAssistant

Summary

A kanban-worker subprocess that hits max_turns (iteration budget exhaustion) exits with rc=0 after the agent loop's "asking model to summarise" path, without ever calling kanban_complete or kanban_block. The dispatcher correctly detects this as a protocol violation (hermes_cli/kanban_db.py:3127), but in production crons this surfaces as a confusing gave_up after 1 failure, with no clear recovery signal for the operator.

Environment

Real-world reproduction

A morning-cron-driven digest pipeline ran 2026-05-10 07:18 CT. Lanes T1 through T8 completed cleanly. T9 writer (t_b1376310) was claimed by a kanban-worker subprocess (PID 13754) at 07:49 CT. The worker progressed through ~30 successful agent iterations, did initial preparation work (read upstream payload, validate evidence), then hit:

⚠️  Iteration budget reached (30/30) — response may be incomplete

The agent's final response listed unfinished steps:

Not completed:
- I did not yet assemble the final render-payload.json.
- I did not yet run workers.helpers.render_charts.
- I did not yet run workers.helpers.render_html.
- I did not yet generate Telegram text.
- I did not yet run workers.helpers.writer_postwrite.
- I did not yet complete the kanban task via complete_validated.

The worker process then exited rc=0. The dispatcher recorded:

event_kind = "protocol_violation"
error_text = "worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation"
gave_up { 'failures': 1, 'effective_limit': 1, 'limit_source': 'dispatcher' }

The downstream T10 deliverer task remained todo and never fired. The morning digest did not deliver to the user.

Root cause

run_agent.py:14232 is the iteration-exhaustion path:

f"⚠️ Iteration budget exhausted ({api_call_count}/{self.max_iterations}) "
"— asking model to summarise"

This path:

  1. Asks the model to produce a final summary message.
  2. Returns the summary as the conversation's final response.
  3. Exits the agent loop with rc=0.

The agent loop has no awareness of kanban-worker context. The kanban-worker contract (call kanban_complete or kanban_block before exiting) lives entirely in the kanban-worker SKILL prompt at skills/devops/kanban-worker/SKILL.md. Iteration-exhaustion bypasses the skill's contract because the model is given the summary directive directly by the agent loop, not by the skill.

The dispatcher (hermes_cli/kanban_db.py:3099-3170) detects this correctly but treats the protocol-violation as a fatal error and gives up after effective_limit: 1 failure with no ability for operator-driven recovery.

Why the worker can't fix itself

The kanban-worker skill text already documents the contract clearly. But:

  • The model can't call kanban_block from inside the iteration-exhaustion summary because at that point the agent loop has already taken control of the prompt and is asking for a summary, not a final tool call.
  • A model attempting to call kanban_block from the summary path would still emit rc=0 if the block call landed in the summary text rather than as a real tool invocation, leaving the dispatcher confused either way.

Proposed fix shapes

Three reasonable fix surfaces, in order of invasiveness:

1. Runtime patch in run_agent.py

When the iteration budget is exhausted AND the agent is running under a kanban-worker context (detect via env var HERMES_KANBAN_TASK_ID or equivalent), auto-emit a kanban_block tool call as the final action before returning the summary. The block reason would be "iteration budget exhausted (N/N); state preserved at <workspace>".

Pros: most explicit, reliable.
Cons: cross-cuts the agent loop with kanban-specific behavior.

2. Dispatcher policy change in hermes_cli/kanban_db.py

Map the protocol_violation event to auto_blocked instead of gave_up on first occurrence (the effective_limit: 1 path). This way the task ends up explicitly blocked with a clear reason, rather than gave_up which suggests the dispatcher gave up on retrying.

Pros: smallest surface area, no run_agent changes.
Cons: doesn't fix the underlying contract violation, just relabels its outcome. But: produces the right operator UX (task is blocked, can be unblock'd, dispatcher will retry).

3. Skill prompt adjustment in skills/devops/kanban-worker/SKILL.md

Add explicit text: "If you sense you are approaching max_turns and have not yet completed the task, your last act must be a real kanban_block tool call, not a free-text response."

Pros: zero runtime change.
Cons: depends on model compliance; the summary prompt is added by the agent loop, not the skill, and the model may not honor the skill instruction once the summary directive is in play.

Asks

  1. Confirm the protocol-violation pattern is intended behavior or a known gap.
  2. Pick a fix shape (or combination) you'd accept upstream.
  3. If a runtime patch is welcome, willing to send a PR.

Related

  • Detection of the protocol violation: hermes_cli/kanban_db.py:3119-3170 (correct, just under-acted-upon).
  • Iteration-exhaustion summary path: run_agent.py:14232.
  • Per-task max_retries override (PR feat(kanban): per-task max_retries override (supersedes #20972) #21330, merged): provides a control surface for retry policy but doesn't address the violation path itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/cronCron scheduler and job managementcomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions