Skip to content

Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response) #19513

@bfroemel

Description

@bfroemel

Name and Version

./llama-cli --version
ggml_vulkan: No devices found.
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-zen4.so
version: 7993 (6d0ee8d4d)
built with GNU 15.2.0 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

AMD Strix Halo, 128 GB

Models

Qwen/Qwen3-Coder-Next-GGUF Q8_0
unsloth/Qwen3-Coder-Next-GGUF UD_Q5_K_XL, UD_Q8_K_XL

Problem description & steps to reproduce

With codex on Qwen3-Coder-Next using the responses API I often see something like this after about 10-20 successful tool calls (EOS token not part of the output, it just directly follows the colon, hence the model generation must stop):

Let me check that other file:<|im_end|>

And the agent turn ends prematurely (because it encountered the EOS token and no tool call was generated). If I tell the model to continue, it does so for a while, but after 1 to 10 tool calls the same issue pops up (until the task is finally done; could take a lot of "please continue!" messages).

For messages where the tool call is not generated the colon token appears to always be this one (and is directly followed by the EOS token):

next token:    25 ':'

For messages where the tool call is generated a different colon token is used (directly before token 151657 '<tool_call>'):

next token:  1447 ':

'

I am using the recommended sampler settings (--temp 1.0 --min-p 0.01 --top-p 0.95 --top-k 40), unsloth's UD-Q8_K_XL (and also Qwen's Q8_0); I haven't noticed any other quality issues. The model is supposed to be very good at tool calls; as it was fine-tuned on all major formats, not just Qwen3-XML (see https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf?spm=a2ty_o06.30285417.0.0.3bdec921p0xaBK&file=qwen3_coder_next_tech_report.pdf Section 4.2.2 + Table 12). At least one other person has experienced the same behavior, while confirming that other harnesses seemingly could give better results.

(Note that I have seen successful/uninterrupted agent turns with 100+ tool calls, but they are rare!

Reproducing the issue
To potentially avoid installing Codex and deal with additional complexities, please find
codex_q3n_req_clean.json
below 40k tokens that failed 10 times in a row on llama.cpp (at least on autobranch, tested quants on Vulkan backend: Qwen/Qwen3-Coder-Next-GGUF Q8_0, unsloth/Qwen3-Coder-Next-GGUF UD-Q8_K_XL) in the sense that it will just generate a tool-call preamble/announcement, but then generates ':'(25)'<|im_end|>'(151645).

curl -X POST http://<llama.cpp-endpoint-host>/v1/responses -H "Content-Type: application/json" -d @./codex_q3n_req.json

I have tested that exact request against openrouter and it generated 5 times: ":\n\n" followed by the expected function_call output item. Note that the generated tool call preamble was slightly different each time (all ofc ending in that colon + 2 newlines), so I don't think that openrouter just cached a (lucky) response.

If I add the tool-call preamble as the model would generate it to the message array the tool call isn't generated (expected). However, if I change that preamble to end with two newlines after the colon, llama.cpp has no problem to generate the tool call.

Attached the request with the added role=assistant message ending on colon and two newlines.
codex_q3n_req_v2.json

More info with others reporting that the issue also affects the master branch and the models Minimax-m2, (potentially?) GLM4.5-Air following the discussion here: #18675 (comment)

cc: @pwilkin @joshvoigts @crashr @Mushoz

First Bad Commit

No response

Relevant log output

See attachments.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions