Summary
Kanban worker exits cleanly (rc=0) with protocol_violation when the agent ends its final turn with a text response (finish_reason=stop) instead of calling kanban_complete or kanban_block. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a consecutive_failures tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work.
Environment
- Hermes version: v0.13.0 (2026.5.7)
- Platform: macOS (Darwin 25.3.0)
- Provider:
openai-codex (ChatGPT subscription OAuth)
- Model:
gpt-5.3-codex
- Board: custom
legal board on a single-host dir workspace (workspace: dir @ /path/to/repo)
- Worker invocation: dispatched by
ai.hermes.gateway launchd service (60s tick)
Reproduction
- Create a kanban task that asks the agent to author a substantial markdown file (in this repro: a California family-law authority note, ~5 KB target with structured sections).
- Gateway dispatcher claims the task and spawns a worker process.
- The agent works through ~28 tool turns / ~29 API calls over ~3 minutes —
search_files, read_file, terminal. One tool error in this trace (pdftotext: command not found, exit_code 127) but the agent recovered.
- Final API call returns:
Turn ended: reason=text_response(finish_reason=stop)
model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28
last_msg_role=assistant response_len=668
session=20260516_163300_efbf57
- The worker proceeds with post-turn housekeeping — the
bg-review skill-library updater is kicked off, terminal environment is cleaned up — then exits rc=0.
- The agent never called
kanban_complete or kanban_block at any point in the session, and never wrote any file either.
- Dispatcher logs:
protocol_violation: worker exited cleanly (rc=0) without calling
kanban_complete or kanban_block — protocol violation
and ticks the failure counter. After 3 such occurrences the task is set to blocked with diagnostic Agent crash x3.
In the original case I hit, the dispatcher even applied effective_limit: 1, limit_source: 'dispatcher' on the first failure (not the max-retries: 2 from the task config), making it harder to recover without manual unblock.
Why this is a worker-side bug, not a model bug
The agent reached finish_reason=stop with a 668-character text response — from the agent's perspective, the turn ended normally. There is no signal in the conversation history that "you ended your turn without checkpointing the kanban task." The worker's protocol enforcement is invisible to the agent itself.
I observed three different root causes for this same dispatcher-level symptom across three runs of the same task:
- Workspace contention — worker spawned while another worker for an earlier task was still using the same
dir workspace. Crashed in <60s.
- Model auth —
config.yaml default had been flipped to a model not available on the user's ChatGPT account (gpt-5.5-codex); HTTP 400 from the codex backend was non-retryable, the configured anthropic fallback failed because the auxiliary client only reads auth.json (not the fallback_providers block in config.yaml), so the worker exited rc=0.
- Text-response-without-checkpoint — described above; agent finished its turn cleanly and exited.
All three surface to the dispatcher with the identical protocol_violation string, which made root-cause diagnosis significantly slower than it needed to be.
Suggested fixes (any one would help)
-
Inject a checkpoint reminder on finish_reason=stop without prior checkpoint. Before the worker exits, if the agent ended its turn with a text response and never called kanban_complete / kanban_block, inject a <system-reminder>-style message back into the conversation: "You ended your turn without calling kanban_complete or kanban_block. Please call the appropriate one to checkpoint your work." Give the agent one more turn to comply.
-
Differentiate the dispatcher error message. protocol_violation should at minimum distinguish between:
- "worker process crashed unexpectedly" (true crash)
- "agent reached
finish_reason=stop without calling protocol tool" (clean exit, no checkpoint)
- "model/auth error caused immediate exit" (worker never got a turn)
Different upstream causes deserve different remediation paths.
-
Render the checkpoint requirement at every turn, not just the initial prompt. Kanban-work system prompts typically describe the checkpoint requirement in the opening text — by turn 28 it's outside the model's effective recency window for instruction-following. A short, persistent reminder appended to each user/tool turn (à la Claude Code's <system-reminder> blocks) would substantially reduce this failure mode.
-
Don't auto-block on the first checkpoint-omission failure. The dispatcher's effective_limit: 1 override of the task's max-retries: 2 makes this failure mode unrecoverable without manual unblock. Where the worker exit shape is "agent thinks it's done, just forgot to checkpoint," a single automatic retry with an injected reminder is much more useful than immediate hard-block.
Logs
I have full agent.log and gateway.log excerpts containing all three failure shapes (~250 KB each, with matter-specific paths). Happy to attach a redacted subset on request; the session IDs and timing above should be enough to triangulate on the Hermes side. Relevant session IDs:
20260516_145050_c0bdcb — failure shape 1 (workspace contention + model auth, mixed signal)
20260516_152454_13e840 — failure shape 2 (clean model auth failure on gpt-5.5-codex)
20260516_163300_efbf57 — failure shape 3 (clean gpt-5.3-codex run, agent finished with text response without checkpoint — the canonical repro of this issue)
Workaround
For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time, tool_turns > 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either:
hermes kanban --board <slug> unblock <task_id> and let the dispatcher retry, OR
- Transfer the work to another agent / do it manually, then
unblock → complete --result "..." --metadata '{...}'. Note: complete doesn't accept --note; use comment first if you want a narrative record.
Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.
Summary
Kanban worker exits cleanly (rc=0) with
protocol_violationwhen the agent ends its final turn with a text response (finish_reason=stop) instead of callingkanban_completeorkanban_block. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as aconsecutive_failurestick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work.Environment
openai-codex(ChatGPT subscription OAuth)gpt-5.3-codexlegalboard on a single-hostdirworkspace (workspace: dir @ /path/to/repo)ai.hermes.gatewaylaunchd service (60s tick)Reproduction
search_files,read_file,terminal. One tool error in this trace (pdftotext: command not found, exit_code 127) but the agent recovered.bg-reviewskill-library updater is kicked off, terminal environment is cleaned up — then exits rc=0.kanban_completeorkanban_blockat any point in the session, and never wrote any file either.blockedwith diagnosticAgent crash x3.In the original case I hit, the dispatcher even applied
effective_limit: 1, limit_source: 'dispatcher'on the first failure (not themax-retries: 2from the task config), making it harder to recover without manualunblock.Why this is a worker-side bug, not a model bug
The agent reached
finish_reason=stopwith a 668-character text response — from the agent's perspective, the turn ended normally. There is no signal in the conversation history that "you ended your turn without checkpointing the kanban task." The worker's protocol enforcement is invisible to the agent itself.I observed three different root causes for this same dispatcher-level symptom across three runs of the same task:
dirworkspace. Crashed in <60s.config.yamldefault had been flipped to a model not available on the user's ChatGPT account (gpt-5.5-codex); HTTP 400 from the codex backend was non-retryable, the configured anthropic fallback failed because the auxiliary client only readsauth.json(not thefallback_providersblock inconfig.yaml), so the worker exited rc=0.All three surface to the dispatcher with the identical
protocol_violationstring, which made root-cause diagnosis significantly slower than it needed to be.Suggested fixes (any one would help)
Inject a checkpoint reminder on
finish_reason=stopwithout prior checkpoint. Before the worker exits, if the agent ended its turn with a text response and never calledkanban_complete/kanban_block, inject a<system-reminder>-style message back into the conversation: "You ended your turn without callingkanban_completeorkanban_block. Please call the appropriate one to checkpoint your work." Give the agent one more turn to comply.Differentiate the dispatcher error message.
protocol_violationshould at minimum distinguish between:finish_reason=stopwithout calling protocol tool" (clean exit, no checkpoint)Different upstream causes deserve different remediation paths.
Render the checkpoint requirement at every turn, not just the initial prompt. Kanban-work system prompts typically describe the checkpoint requirement in the opening text — by turn 28 it's outside the model's effective recency window for instruction-following. A short, persistent reminder appended to each user/tool turn (à la Claude Code's
<system-reminder>blocks) would substantially reduce this failure mode.Don't auto-block on the first checkpoint-omission failure. The dispatcher's
effective_limit: 1override of the task'smax-retries: 2makes this failure mode unrecoverable without manualunblock. Where the worker exit shape is "agent thinks it's done, just forgot to checkpoint," a single automatic retry with an injected reminder is much more useful than immediate hard-block.Logs
I have full
agent.logandgateway.logexcerpts containing all three failure shapes (~250 KB each, with matter-specific paths). Happy to attach a redacted subset on request; the session IDs and timing above should be enough to triangulate on the Hermes side. Relevant session IDs:20260516_145050_c0bdcb— failure shape 1 (workspace contention + model auth, mixed signal)20260516_152454_13e840— failure shape 2 (clean model auth failure ongpt-5.5-codex)20260516_163300_efbf57— failure shape 3 (cleangpt-5.3-codexrun, agent finished with text response without checkpoint — the canonical repro of this issue)Workaround
For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time,
tool_turns> 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either:hermes kanban --board <slug> unblock <task_id>and let the dispatcher retry, ORunblock→complete --result "..." --metadata '{...}'. Note:completedoesn't accept--note; usecommentfirst if you want a narrative record.Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.