[Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks

## Summary

A kanban-worker subprocess that hits `max_turns` (iteration budget exhaustion) exits with `rc=0` after the agent loop's "asking model to summarise" path, without ever calling `kanban_complete` or `kanban_block`. The dispatcher correctly detects this as a protocol violation (`hermes_cli/kanban_db.py:3127`), but in production crons this surfaces as a confusing `gave_up` after 1 failure, with no clear recovery signal for the operator.

## Environment

- Hermes Agent: v0.13.0 (v2026.5.7), commit `eeef486` baseline + two local cherry-picks (`aaa700c65` = PR #12953 keepalive bypass, `4ce6c96e2` = PR #19485 runtime TLS).
- Kanban-driven workload: a multi-stage DAG of profile-specific worker tasks (e.g. `digest-writer`).
- Worker config: `agent.max_turns: 30` in profile config (per-profile cap, lower than global 100).

## Real-world reproduction

A morning-cron-driven digest pipeline ran 2026-05-10 07:18 CT. Lanes T1 through T8 completed cleanly. T9 writer (`t_b1376310`) was claimed by a kanban-worker subprocess (PID 13754) at 07:49 CT. The worker progressed through ~30 successful agent iterations, did initial preparation work (read upstream payload, validate evidence), then hit:

```
⚠️  Iteration budget reached (30/30) — response may be incomplete
```

The agent's final response listed unfinished steps:

```
Not completed:
- I did not yet assemble the final render-payload.json.
- I did not yet run workers.helpers.render_charts.
- I did not yet run workers.helpers.render_html.
- I did not yet generate Telegram text.
- I did not yet run workers.helpers.writer_postwrite.
- I did not yet complete the kanban task via complete_validated.
```

The worker process then exited `rc=0`. The dispatcher recorded:

```
event_kind = "protocol_violation"
error_text = "worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation"
gave_up { 'failures': 1, 'effective_limit': 1, 'limit_source': 'dispatcher' }
```

The downstream T10 deliverer task remained `todo` and never fired. The morning digest did not deliver to the user.

## Root cause

`run_agent.py:14232` is the iteration-exhaustion path:

```python
f"⚠️ Iteration budget exhausted ({api_call_count}/{self.max_iterations}) "
"— asking model to summarise"
```

This path:

1. Asks the model to produce a final summary message.
2. Returns the summary as the conversation's final response.
3. Exits the agent loop with `rc=0`.

The agent loop has no awareness of kanban-worker context. The kanban-worker contract (call `kanban_complete` or `kanban_block` before exiting) lives entirely in the kanban-worker SKILL prompt at `skills/devops/kanban-worker/SKILL.md`. Iteration-exhaustion bypasses the skill's contract because the model is given the summary directive directly by the agent loop, not by the skill.

The dispatcher (`hermes_cli/kanban_db.py:3099-3170`) detects this correctly but treats the protocol-violation as a fatal error and gives up after `effective_limit: 1` failure with no ability for operator-driven recovery.

## Why the worker can't fix itself

The kanban-worker skill text already documents the contract clearly. But:

- The model can't call `kanban_block` from inside the iteration-exhaustion summary because at that point the agent loop has already taken control of the prompt and is asking for a summary, not a final tool call.
- A model attempting to call `kanban_block` from the summary path would still emit `rc=0` if the block call landed in the summary text rather than as a real tool invocation, leaving the dispatcher confused either way.

## Proposed fix shapes

Three reasonable fix surfaces, in order of invasiveness:

### 1. Runtime patch in `run_agent.py`

When the iteration budget is exhausted AND the agent is running under a kanban-worker context (detect via env var `HERMES_KANBAN_TASK_ID` or equivalent), auto-emit a `kanban_block` tool call as the final action before returning the summary. The block reason would be `"iteration budget exhausted (N/N); state preserved at <workspace>"`.

Pros: most explicit, reliable.
Cons: cross-cuts the agent loop with kanban-specific behavior.

### 2. Dispatcher policy change in `hermes_cli/kanban_db.py`

Map the `protocol_violation` event to `auto_blocked` instead of `gave_up` on first occurrence (the `effective_limit: 1` path). This way the task ends up explicitly `blocked` with a clear reason, rather than `gave_up` which suggests the dispatcher gave up on retrying.

Pros: smallest surface area, no run_agent changes.
Cons: doesn't fix the underlying contract violation, just relabels its outcome. But: produces the right operator UX (task is `blocked`, can be `unblock`'d, dispatcher will retry).

### 3. Skill prompt adjustment in `skills/devops/kanban-worker/SKILL.md`

Add explicit text: "If you sense you are approaching `max_turns` and have not yet completed the task, your last act must be a real `kanban_block` tool call, not a free-text response."

Pros: zero runtime change.
Cons: depends on model compliance; the summary prompt is added by the agent loop, not the skill, and the model may not honor the skill instruction once the summary directive is in play.

## Asks

1. Confirm the protocol-violation pattern is intended behavior or a known gap.
2. Pick a fix shape (or combination) you'd accept upstream.
3. If a runtime patch is welcome, willing to send a PR.

## Related

- Detection of the protocol violation: `hermes_cli/kanban_db.py:3119-3170` (correct, just under-acted-upon).
- Iteration-exhaustion summary path: `run_agent.py:14232`.
- Per-task `max_retries` override (PR #21330, merged): provides a control surface for retry policy but doesn't address the violation path itself.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks #23216

Summary

Environment

Real-world reproduction

Root cause

Why the worker can't fix itself

Proposed fix shapes

1. Runtime patch in `run_agent.py`

2. Dispatcher policy change in `hermes_cli/kanban_db.py`

3. Skill prompt adjustment in `skills/devops/kanban-worker/SKILL.md`

Asks

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block — protocol violation strands downstream tasks #23216

Description

Summary

Environment

Real-world reproduction

Root cause

Why the worker can't fix itself

Proposed fix shapes

1. Runtime patch in run_agent.py

2. Dispatcher policy change in hermes_cli/kanban_db.py

3. Skill prompt adjustment in skills/devops/kanban-worker/SKILL.md

Asks

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Runtime patch in `run_agent.py`

2. Dispatcher policy change in `hermes_cli/kanban_db.py`

3. Skill prompt adjustment in `skills/devops/kanban-worker/SKILL.md`