Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block

## Summary

Kanban worker exits cleanly (rc=0) with `protocol_violation` when the agent ends its final turn with a text response (`finish_reason=stop`) instead of calling `kanban_complete` or `kanban_block`. There is no nudge back to the agent, no fallback in the worker, and the dispatcher counts each occurrence as a `consecutive_failures` tick — so after the retry limit the task is hard-blocked despite the agent having completed (or partially completed) substantive work.

## Environment

- **Hermes version:** v0.13.0 (2026.5.7)
- **Platform:** macOS (Darwin 25.3.0)
- **Provider:** `openai-codex` (ChatGPT subscription OAuth)
- **Model:** `gpt-5.3-codex`
- **Board:** custom `legal` board on a single-host `dir` workspace (`workspace: dir @ /path/to/repo`)
- **Worker invocation:** dispatched by `ai.hermes.gateway` launchd service (60s tick)

## Reproduction

1. Create a kanban task that asks the agent to author a substantial markdown file (in this repro: a California family-law authority note, ~5 KB target with structured sections).
2. Gateway dispatcher claims the task and spawns a worker process.
3. The agent works through ~28 tool turns / ~29 API calls over ~3 minutes — `search_files`, `read_file`, `terminal`. One tool error in this trace (`pdftotext: command not found, exit_code 127`) but the agent recovered.
4. Final API call returns:
   ```
   Turn ended: reason=text_response(finish_reason=stop)
     model=gpt-5.3-codex api_calls=29/60 budget=29/60 tool_turns=28
     last_msg_role=assistant response_len=668
     session=20260516_163300_efbf57
   ```
5. The worker proceeds with post-turn housekeeping — the `bg-review` skill-library updater is kicked off, terminal environment is cleaned up — then exits rc=0.
6. **The agent never called `kanban_complete` or `kanban_block`** at any point in the session, and never wrote any file either.
7. Dispatcher logs:
   ```
   protocol_violation: worker exited cleanly (rc=0) without calling
     kanban_complete or kanban_block — protocol violation
   ```
   and ticks the failure counter. After 3 such occurrences the task is set to `blocked` with diagnostic `Agent crash x3`.

In the original case I hit, the dispatcher even applied `effective_limit: 1, limit_source: 'dispatcher'` on the first failure (not the `max-retries: 2` from the task config), making it harder to recover without manual `unblock`.

## Why this is a worker-side bug, not a model bug

The agent reached `finish_reason=stop` with a 668-character text response — from the agent's perspective, the turn ended normally. There is no signal in the conversation history that "you ended your turn without checkpointing the kanban task." The worker's protocol enforcement is invisible to the agent itself.

I observed three different root causes for this same dispatcher-level symptom across three runs of the same task:

1. **Workspace contention** — worker spawned while another worker for an earlier task was still using the same `dir` workspace. Crashed in <60s.
2. **Model auth** — `config.yaml` default had been flipped to a model not available on the user's ChatGPT account (`gpt-5.5-codex`); HTTP 400 from the codex backend was non-retryable, the configured anthropic fallback failed because the auxiliary client only reads `auth.json` (not the `fallback_providers` block in `config.yaml`), so the worker exited rc=0.
3. **Text-response-without-checkpoint** — described above; agent finished its turn cleanly and exited.

All three surface to the dispatcher with the identical `protocol_violation` string, which made root-cause diagnosis significantly slower than it needed to be.

## Suggested fixes (any one would help)

1. **Inject a checkpoint reminder on `finish_reason=stop` without prior checkpoint.** Before the worker exits, if the agent ended its turn with a text response and never called `kanban_complete` / `kanban_block`, inject a `<system-reminder>`-style message back into the conversation: "You ended your turn without calling `kanban_complete` or `kanban_block`. Please call the appropriate one to checkpoint your work." Give the agent one more turn to comply.

2. **Differentiate the dispatcher error message.** `protocol_violation` should at minimum distinguish between:
   - "worker process crashed unexpectedly" (true crash)
   - "agent reached `finish_reason=stop` without calling protocol tool" (clean exit, no checkpoint)
   - "model/auth error caused immediate exit" (worker never got a turn)

   Different upstream causes deserve different remediation paths.

3. **Render the checkpoint requirement at every turn**, not just the initial prompt. Kanban-work system prompts typically describe the checkpoint requirement in the opening text — by turn 28 it's outside the model's effective recency window for instruction-following. A short, persistent reminder appended to each user/tool turn (à la Claude Code's `<system-reminder>` blocks) would substantially reduce this failure mode.

4. **Don't auto-block on the first checkpoint-omission failure.** The dispatcher's `effective_limit: 1` override of the task's `max-retries: 2` makes this failure mode unrecoverable without manual `unblock`. Where the worker exit shape is "agent thinks it's done, just forgot to checkpoint," a single automatic retry with an injected reminder is much more useful than immediate hard-block.

## Logs

I have full `agent.log` and `gateway.log` excerpts containing all three failure shapes (~250 KB each, with matter-specific paths). Happy to attach a redacted subset on request; the session IDs and timing above should be enough to triangulate on the Hermes side. Relevant session IDs:

- `20260516_145050_c0bdcb` — failure shape 1 (workspace contention + model auth, mixed signal)
- `20260516_152454_13e840` — failure shape 2 (clean model auth failure on `gpt-5.5-codex`)
- `20260516_163300_efbf57` — failure shape 3 (clean `gpt-5.3-codex` run, agent finished with text response without checkpoint — the canonical repro of this issue)

## Workaround

For now: if you suspect this failure mode (worker ran for non-trivial wall-clock time, `tool_turns` > 5, but no file changes on disk), assume the agent did exploratory work and decided "I'm done" without checkpointing. Either:

- `hermes kanban --board <slug> unblock <task_id>` and let the dispatcher retry, OR
- Transfer the work to another agent / do it manually, then `unblock` → `complete --result "..." --metadata '{...}'`. Note: `complete` doesn't accept `--note`; use `comment` first if you want a narrative record.

Thanks for the great tool — this is a sharp edge but the kanban pattern itself is excellent and worth the polish.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block #27178

Summary

Environment

Reproduction

Why this is a worker-side bug, not a model bug

Suggested fixes (any one would help)

Logs

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kanban worker reports 'protocol_violation' when agent ends turn with text response instead of calling kanban_complete/kanban_block #27178

Description

Summary

Environment

Reproduction

Why this is a worker-side bug, not a model bug

Suggested fixes (any one would help)

Logs

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions