Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response)

### Name and Version

./llama-cli --version
ggml_vulkan: No devices found.
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-zen4.so
version: 7993 (6d0ee8d4d)
built with GNU 15.2.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

Vulkan

### Hardware

AMD Strix Halo, 128 GB

### Models

Qwen/Qwen3-Coder-Next-GGUF Q8_0
unsloth/Qwen3-Coder-Next-GGUF UD_Q5_K_XL, UD_Q8_K_XL

### Problem description & steps to reproduce

With codex on Qwen3-Coder-Next using the responses API I often see something like this after about 10-20 successful tool calls (EOS token not part of the output, it just directly follows the colon, hence the model generation must stop):

```
Let me check that other file:<|im_end|>
```

And the agent turn ends prematurely (because it encountered the EOS token and no tool call was generated). If I tell the model to continue, it does so for a while, but after 1 to 10 tool calls the same issue pops up (until the task is finally done; could take a lot of "please continue!" messages).

For messages where the tool call is not generated the colon token appears to always be this one (and is directly followed by the EOS token):

```
next token:    25 ':'
```

For messages where the tool call is generated a different colon token is used (directly before token 151657 '<tool_call>'):

```
next token:  1447 ':

'
```

I am using the recommended sampler settings (`--temp 1.0 --min-p 0.01 --top-p 0.95 --top-k 40`), unsloth's UD-Q8_K_XL (and also Qwen's Q8_0); I haven't noticed any other quality issues. The model is supposed to be very good at tool calls; as it was fine-tuned on all major formats, not just Qwen3-XML (see https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf?spm=a2ty_o06.30285417.0.0.3bdec921p0xaBK&file=qwen3_coder_next_tech_report.pdf Section 4.2.2 + Table 12). At least one other person has experienced the same behavior, while confirming that other harnesses seemingly could give better results.

(Note that I have seen successful/uninterrupted agent turns with 100+ tool calls, but they are rare!

**Reproducing the issue**
To potentially avoid installing Codex and deal with additional complexities, please find 
[codex_q3n_req_clean.json](https://github.com/user-attachments/files/25230635/codex_q3n_req_clean.json)
 below 40k tokens that failed 10 times in a row on llama.cpp (at least on autobranch, tested quants on Vulkan backend: Qwen/Qwen3-Coder-Next-GGUF Q8_0, unsloth/Qwen3-Coder-Next-GGUF UD-Q8_K_XL) in the sense that it will just generate a tool-call preamble/announcement, but then generates `':'(25)'<|im_end|>'(151645)`.

```
curl -X POST http://<llama.cpp-endpoint-host>/v1/responses -H "Content-Type: application/json" -d @./codex_q3n_req.json
```

I have tested that exact request against openrouter and it generated 5 times: ":\n\n" followed by the expected function_call output item. Note that the generated tool call preamble was slightly different each time (all ofc ending in that colon + 2 newlines), so I don't think that openrouter just cached a (lucky) response.

If I add the tool-call preamble as the model would generate it to the message array the tool call isn't generated (expected). However, if I change that preamble to end with two newlines after the colon, llama.cpp has no problem to generate the tool call.

Attached the request with the added role=assistant message ending on colon and two newlines. 
[codex_q3n_req_v2.json](https://github.com/user-attachments/files/25231462/codex_q3n_req_v2.json)

More info with others reporting that the issue also affects the master branch and the models Minimax-m2, (potentially?) GLM4.5-Air following the discussion here: https://github.com/ggml-org/llama.cpp/pull/18675#issuecomment-3879772240

cc: @pwilkin @joshvoigts @crashr @Mushoz

### First Bad Commit

_No response_

### Relevant log output

See attachments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response) #19513

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response) #19513

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions