Slightly better graph parallel for Qwen3-Next#1307
Merged
Conversation
74bc685 to
35da97d
Compare
|
ERRDATA: random fluctuations observerd DetailsYeah, there is a little boost for the Qwen3.5 IQ2_KL. main:
pr-1307:
|
Owner
Author
|
The PR doesn't do anything for Qwen-3.5 (no graph parallel there yet), so these are random fluctuations. |
Contributor
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 25, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 25, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 25, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 26, 2026
abc-nix
pushed a commit
to abc-nix/ik_llama.cpp
that referenced
this pull request
Feb 26, 2026
* Make sure we pick the reduced tensor from the right GPU * Minor
abc-nix
pushed a commit
to abc-nix/ik_llama.cpp
that referenced
this pull request
Feb 26, 2026
* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 26, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 26, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 26, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 27, 2026
…akow#1307)" LCPP part" This reverts commit 591d5cc.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 27, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Feb 27, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 1, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 1, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 1, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 1, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 2, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 2, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 2, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 2, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 3, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 3, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 5, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 5, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 5, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 5, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 5, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 5, 2026
…" LCPP part This reverts commit a8dcf26.
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 7, 2026
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 7, 2026
…" LCPP part This reverts commit a8dcf26.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

The Qwen3-Next graph parallel implementation computes the delta-net attention on one GPU, see #1292. As the preceding split graph is always the FFN portion, which ends with a
reduceoperation, we have the FFN result, which is the input for the delta-net, on each participating GPU. This PR makes sure that the delta-net uses as input the FFN result on the GPU on which the delta-net is computed, thus avoiding a copy.A second tweak is that for the final matrix multiplication with the output tensor we also take the FFN result from the GPU where the output tensor is stored. This may give a minor performance improvement for other models as well (I did a quick check with GLM-4.5-AIR and saw ~1% improvement for PP and TG).
These two tweaks combined give a few percent better PP and TG for Qwen3-Next when using split mode
graph. Here is what I get on a 2x3090 system forIQ4_XS-quantized Qwen3-Next: