You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A kanban-worker subprocess that hits max_turns (iteration budget exhaustion) exits with rc=0 after the agent loop's "asking model to summarise" path, without ever calling kanban_complete or kanban_block. The dispatcher correctly detects this as a protocol violation (hermes_cli/kanban_db.py:3127), but in production crons this surfaces as a confusing gave_up after 1 failure, with no clear recovery signal for the operator.
Kanban-driven workload: a multi-stage DAG of profile-specific worker tasks (e.g. digest-writer).
Worker config: agent.max_turns: 30 in profile config (per-profile cap, lower than global 100).
Real-world reproduction
A morning-cron-driven digest pipeline ran 2026-05-10 07:18 CT. Lanes T1 through T8 completed cleanly. T9 writer (t_b1376310) was claimed by a kanban-worker subprocess (PID 13754) at 07:49 CT. The worker progressed through ~30 successful agent iterations, did initial preparation work (read upstream payload, validate evidence), then hit:
⚠️ Iteration budget reached (30/30) — response may be incomplete
The agent's final response listed unfinished steps:
Not completed:
- I did not yet assemble the final render-payload.json.
- I did not yet run workers.helpers.render_charts.
- I did not yet run workers.helpers.render_html.
- I did not yet generate Telegram text.
- I did not yet run workers.helpers.writer_postwrite.
- I did not yet complete the kanban task via complete_validated.
The worker process then exited rc=0. The dispatcher recorded:
The downstream T10 deliverer task remained todo and never fired. The morning digest did not deliver to the user.
Root cause
run_agent.py:14232 is the iteration-exhaustion path:
f"⚠️ Iteration budget exhausted ({api_call_count}/{self.max_iterations}) ""— asking model to summarise"
This path:
Asks the model to produce a final summary message.
Returns the summary as the conversation's final response.
Exits the agent loop with rc=0.
The agent loop has no awareness of kanban-worker context. The kanban-worker contract (call kanban_complete or kanban_block before exiting) lives entirely in the kanban-worker SKILL prompt at skills/devops/kanban-worker/SKILL.md. Iteration-exhaustion bypasses the skill's contract because the model is given the summary directive directly by the agent loop, not by the skill.
The dispatcher (hermes_cli/kanban_db.py:3099-3170) detects this correctly but treats the protocol-violation as a fatal error and gives up after effective_limit: 1 failure with no ability for operator-driven recovery.
Why the worker can't fix itself
The kanban-worker skill text already documents the contract clearly. But:
The model can't call kanban_block from inside the iteration-exhaustion summary because at that point the agent loop has already taken control of the prompt and is asking for a summary, not a final tool call.
A model attempting to call kanban_block from the summary path would still emit rc=0 if the block call landed in the summary text rather than as a real tool invocation, leaving the dispatcher confused either way.
Proposed fix shapes
Three reasonable fix surfaces, in order of invasiveness:
1. Runtime patch in run_agent.py
When the iteration budget is exhausted AND the agent is running under a kanban-worker context (detect via env var HERMES_KANBAN_TASK_ID or equivalent), auto-emit a kanban_block tool call as the final action before returning the summary. The block reason would be "iteration budget exhausted (N/N); state preserved at <workspace>".
Pros: most explicit, reliable.
Cons: cross-cuts the agent loop with kanban-specific behavior.
2. Dispatcher policy change in hermes_cli/kanban_db.py
Map the protocol_violation event to auto_blocked instead of gave_up on first occurrence (the effective_limit: 1 path). This way the task ends up explicitly blocked with a clear reason, rather than gave_up which suggests the dispatcher gave up on retrying.
Pros: smallest surface area, no run_agent changes.
Cons: doesn't fix the underlying contract violation, just relabels its outcome. But: produces the right operator UX (task is blocked, can be unblock'd, dispatcher will retry).
3. Skill prompt adjustment in skills/devops/kanban-worker/SKILL.md
Add explicit text: "If you sense you are approaching max_turns and have not yet completed the task, your last act must be a real kanban_block tool call, not a free-text response."
Pros: zero runtime change.
Cons: depends on model compliance; the summary prompt is added by the agent loop, not the skill, and the model may not honor the skill instruction once the summary directive is in play.
Asks
Confirm the protocol-violation pattern is intended behavior or a known gap.
Pick a fix shape (or combination) you'd accept upstream.
If a runtime patch is welcome, willing to send a PR.
Related
Detection of the protocol violation: hermes_cli/kanban_db.py:3119-3170 (correct, just under-acted-upon).
Summary
A kanban-worker subprocess that hits
max_turns(iteration budget exhaustion) exits withrc=0after the agent loop's "asking model to summarise" path, without ever callingkanban_completeorkanban_block. The dispatcher correctly detects this as a protocol violation (hermes_cli/kanban_db.py:3127), but in production crons this surfaces as a confusinggave_upafter 1 failure, with no clear recovery signal for the operator.Environment
eeef486baseline + two local cherry-picks (aaa700c65= PR fix(codex): avoid custom keepalive transport on chatgpt backend #12953 keepalive bypass,4ce6c96e2= PR fix(auxiliary): resolve provider/model from live runtime, not stale config #19485 runtime TLS).digest-writer).agent.max_turns: 30in profile config (per-profile cap, lower than global 100).Real-world reproduction
A morning-cron-driven digest pipeline ran 2026-05-10 07:18 CT. Lanes T1 through T8 completed cleanly. T9 writer (
t_b1376310) was claimed by a kanban-worker subprocess (PID 13754) at 07:49 CT. The worker progressed through ~30 successful agent iterations, did initial preparation work (read upstream payload, validate evidence), then hit:The agent's final response listed unfinished steps:
The worker process then exited
rc=0. The dispatcher recorded:The downstream T10 deliverer task remained
todoand never fired. The morning digest did not deliver to the user.Root cause
run_agent.py:14232is the iteration-exhaustion path:This path:
rc=0.The agent loop has no awareness of kanban-worker context. The kanban-worker contract (call
kanban_completeorkanban_blockbefore exiting) lives entirely in the kanban-worker SKILL prompt atskills/devops/kanban-worker/SKILL.md. Iteration-exhaustion bypasses the skill's contract because the model is given the summary directive directly by the agent loop, not by the skill.The dispatcher (
hermes_cli/kanban_db.py:3099-3170) detects this correctly but treats the protocol-violation as a fatal error and gives up aftereffective_limit: 1failure with no ability for operator-driven recovery.Why the worker can't fix itself
The kanban-worker skill text already documents the contract clearly. But:
kanban_blockfrom inside the iteration-exhaustion summary because at that point the agent loop has already taken control of the prompt and is asking for a summary, not a final tool call.kanban_blockfrom the summary path would still emitrc=0if the block call landed in the summary text rather than as a real tool invocation, leaving the dispatcher confused either way.Proposed fix shapes
Three reasonable fix surfaces, in order of invasiveness:
1. Runtime patch in
run_agent.pyWhen the iteration budget is exhausted AND the agent is running under a kanban-worker context (detect via env var
HERMES_KANBAN_TASK_IDor equivalent), auto-emit akanban_blocktool call as the final action before returning the summary. The block reason would be"iteration budget exhausted (N/N); state preserved at <workspace>".Pros: most explicit, reliable.
Cons: cross-cuts the agent loop with kanban-specific behavior.
2. Dispatcher policy change in
hermes_cli/kanban_db.pyMap the
protocol_violationevent toauto_blockedinstead ofgave_upon first occurrence (theeffective_limit: 1path). This way the task ends up explicitlyblockedwith a clear reason, rather thangave_upwhich suggests the dispatcher gave up on retrying.Pros: smallest surface area, no run_agent changes.
Cons: doesn't fix the underlying contract violation, just relabels its outcome. But: produces the right operator UX (task is
blocked, can beunblock'd, dispatcher will retry).3. Skill prompt adjustment in
skills/devops/kanban-worker/SKILL.mdAdd explicit text: "If you sense you are approaching
max_turnsand have not yet completed the task, your last act must be a realkanban_blocktool call, not a free-text response."Pros: zero runtime change.
Cons: depends on model compliance; the summary prompt is added by the agent loop, not the skill, and the model may not honor the skill instruction once the summary directive is in play.
Asks
Related
hermes_cli/kanban_db.py:3119-3170(correct, just under-acted-upon).run_agent.py:14232.max_retriesoverride (PR feat(kanban): per-task max_retries override (supersedes #20972) #21330, merged): provides a control surface for retry policy but doesn't address the violation path itself.