-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Name and Version
./llama-cli --version
ggml_vulkan: No devices found.
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-zen4.so
version: 7993 (6d0ee8d4d)
built with GNU 15.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
Vulkan
Hardware
AMD Strix Halo, 128 GB
Models
Qwen/Qwen3-Coder-Next-GGUF Q8_0
unsloth/Qwen3-Coder-Next-GGUF UD_Q5_K_XL, UD_Q8_K_XL
Problem description & steps to reproduce
With codex on Qwen3-Coder-Next using the responses API I often see something like this after about 10-20 successful tool calls (EOS token not part of the output, it just directly follows the colon, hence the model generation must stop):
Let me check that other file:<|im_end|>
And the agent turn ends prematurely (because it encountered the EOS token and no tool call was generated). If I tell the model to continue, it does so for a while, but after 1 to 10 tool calls the same issue pops up (until the task is finally done; could take a lot of "please continue!" messages).
For messages where the tool call is not generated the colon token appears to always be this one (and is directly followed by the EOS token):
next token: 25 ':'
For messages where the tool call is generated a different colon token is used (directly before token 151657 '<tool_call>'):
next token: 1447 ':
'
I am using the recommended sampler settings (--temp 1.0 --min-p 0.01 --top-p 0.95 --top-k 40), unsloth's UD-Q8_K_XL (and also Qwen's Q8_0); I haven't noticed any other quality issues. The model is supposed to be very good at tool calls; as it was fine-tuned on all major formats, not just Qwen3-XML (see https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf?spm=a2ty_o06.30285417.0.0.3bdec921p0xaBK&file=qwen3_coder_next_tech_report.pdf Section 4.2.2 + Table 12). At least one other person has experienced the same behavior, while confirming that other harnesses seemingly could give better results.
(Note that I have seen successful/uninterrupted agent turns with 100+ tool calls, but they are rare!
Reproducing the issue
To potentially avoid installing Codex and deal with additional complexities, please find
codex_q3n_req_clean.json
below 40k tokens that failed 10 times in a row on llama.cpp (at least on autobranch, tested quants on Vulkan backend: Qwen/Qwen3-Coder-Next-GGUF Q8_0, unsloth/Qwen3-Coder-Next-GGUF UD-Q8_K_XL) in the sense that it will just generate a tool-call preamble/announcement, but then generates ':'(25)'<|im_end|>'(151645).
curl -X POST http://<llama.cpp-endpoint-host>/v1/responses -H "Content-Type: application/json" -d @./codex_q3n_req.json
I have tested that exact request against openrouter and it generated 5 times: ":\n\n" followed by the expected function_call output item. Note that the generated tool call preamble was slightly different each time (all ofc ending in that colon + 2 newlines), so I don't think that openrouter just cached a (lucky) response.
If I add the tool-call preamble as the model would generate it to the message array the tool call isn't generated (expected). However, if I change that preamble to end with two newlines after the colon, llama.cpp has no problem to generate the tool call.
Attached the request with the added role=assistant message ending on colon and two newlines.
codex_q3n_req_v2.json
More info with others reporting that the issue also affects the master branch and the models Minimax-m2, (potentially?) GLM4.5-Air following the discussion here: #18675 (comment)
cc: @pwilkin @joshvoigts @crashr @Mushoz
First Bad Commit
No response
Relevant log output
See attachments.