common : merge qwen3-coder and nemotron nano 3 parsers#19765
common : merge qwen3-coder and nemotron nano 3 parsers#19765pwilkin merged 2 commits intoggml-org:masterfrom
Conversation
pwilkin
left a comment
There was a problem hiding this comment.
Can you please add some tests on schema parameters, especially an array (such as a "todolist" tool)?
|
I'll try to test it today with OpenCode to see how it works. |
As I understand it, the model generates the following tokens: But when we parse it, we strip Which causes issues? This makes sense to me. It's a bit annoying because ultimately it depends on the model + template. In this case, we need to strip a single |
|
@aldehir Oh, this is a fun one :) This causes issues because in the tokenizer, When you pass this back as ":\n", this is no longer the same token. Therefore, the model at some point gets confused and stops doing tool calls if enough of those bad historical calls are done because it doesn't know what to do after a token it wasn't trained to call tools on. |
Upon further inspection, I am struggling to see the issue. The chat template will strip the message content and append a https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/chat_template.jinja Then, it prepends a Assuming the end of the content is |
|
@aldehir Are you sure that the template adds the newline again? I believe the tool call preamble (as it occurs with codex use/potentially related to the responses API) is rendered as a "normal" assistant message: Anyway, I can confirm that currently it's an issue with this PR, @pwilkin's fix (#18675 (comment)) lead to the apparently expected format and the issue was gone. |
|
oh, remembering - I fell for the same "trap" :) #18675 (comment) (marked it as off-topic, contains the actually rendered prompt missing the required newlines) |
I see. It's a bit weird that the preamble would be its own message. This means the client needs to perform another request to trigger a tool call, which doesn't quite align with the idea that retaining |
|
Sure enough, the Responses API is the culprit: Rather than generate a single message with |
|
Here just for reference a snippet from a responses API request as generated by codex (this example misses the 2 newlines at the end of tool preamble message), input array elements: Not seeing a case where content + tool call could be currently possible with https://github.com/openingnow/llama.cpp/blob/292f6908cdc6abb5c38581e34fa141973e5aba82/tools/server/server-common.cpp#L1072 I believe in the responses API there is no way to have type "message" contain also "function_call" elements. https://developers.openai.com/api/reference/resources/responses#(resource)%20responses%20%3E%20(model)%20response_input_item%20%3E%20(schema) |
|
Yes, I believe this is called a "leaky abstraction" 😊. We should add a post step in the conversion that merges consecutive assistant messages. That seems like the most correction solution. |
|
Aight, merging this one and we'll do a followup on the Responses problem. |
|
@bfroemel expect to see a PR for the Responses API here shortly. If you don't mind, I'd like to enlist your services to help test it out. |
|
Hmm, so the model generates content output + tool call(s) - everything in a single generation; so single response. codex (as a responses API client) processes that response and builds the conversation in the input array for the next request with tool call responses. There it ends up as several elements (output_text message, tool calls, tool responses). IMO that's fine and wouldn't immediately require to do post processing (trying to get multiple responses input elements into single chat completions elements), as long as it ends up as the expected prompt for the model (I looked at the rendered prompt - it's imo indistinguishable)? but ofc fine with any other solution that might be more clean/maintainable - especially if @pwilkin's fix could be problematic for other models. /edit: adding a rendered prompt snippet (with the required newlines) |
|
This is what I expect: The additional |
* Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one
* common : migrate qwen3-coder to PEG parsing variant * cont : add JSON parameter test
* Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one
* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench
* common : migrate qwen3-coder to PEG parsing variant * cont : add JSON parameter test
* common : migrate qwen3-coder to PEG parsing variant * cont : add JSON parameter test
Users are experiencing several issues with Qwen3-Coder-Next. Until #18675 is merged in, this PR serves as a stop-gap by replacing the existing Qwen3-Coder parsing with the Nemotron Nano 3 PEG parsing variant already present.
This PR also adds parallel tool calling and fixes JSON schema support.
fixes #19382
fixes #19430
fixes #19304
supersedes #19503 and #19753