Summary
Some sessions get stuck "spinning": the agent repeats a near-identical message (e.g. "Right — let me do X…") before every tool call, re-narrating the same plan over and over instead of making progress. It keeps going until the context window fills up and the session is force-compacted.
Affected versions
0.20.3, 0.21.0, 0.21.1 (and dev until fixed).
Introduced by #1178. The same class of bug was previously fixed in #634 — #1178 reintroduced it in a different place.
Who's affected
Sessions running against strict ChatML chat templates — Qwen3 on llama.cpp / vLLM in particular. First observed on a live Slack session (channel D0AC6CKBK5K).
What's actually happening
Every turn we add a "volatile context" block (memory recall, current time, working context, etc.) to the conversation as a user message, placed right after the real user message.
Strict ChatML models read that trailing user-role block as a brand-new user turn. So the model restarts its response, scans back for the last real user request, and re-emits its opening line before each tool call — and repeats this on every tool-loop iteration until the context is exhausted.
Fix direction
Place the volatile context block before the real user message instead of after, so the real user message stays the last thing the model sees. This keeps the KV-cache benefits from #1178 (the bytes still live in history at a fixed position) while removing the trailing user-role block that triggers the loop. Fix in progress.
Summary
Some sessions get stuck "spinning": the agent repeats a near-identical message (e.g. "Right — let me do X…") before every tool call, re-narrating the same plan over and over instead of making progress. It keeps going until the context window fills up and the session is force-compacted.
Affected versions
0.20.3, 0.21.0, 0.21.1 (and
devuntil fixed).Introduced by #1178. The same class of bug was previously fixed in #634 — #1178 reintroduced it in a different place.
Who's affected
Sessions running against strict ChatML chat templates — Qwen3 on llama.cpp / vLLM in particular. First observed on a live Slack session (channel
D0AC6CKBK5K).What's actually happening
Every turn we add a "volatile context" block (memory recall, current time, working context, etc.) to the conversation as a user message, placed right after the real user message.
Strict ChatML models read that trailing user-role block as a brand-new user turn. So the model restarts its response, scans back for the last real user request, and re-emits its opening line before each tool call — and repeats this on every tool-loop iteration until the context is exhausted.
Fix direction
Place the volatile context block before the real user message instead of after, so the real user message stays the last thing the model sees. This keeps the KV-cache benefits from #1178 (the bytes still live in history at a fixed position) while removing the trailing user-role block that triggers the loop. Fix in progress.