Skip to content

Slightly better graph parallel for Qwen3-Next#1307

Merged
ikawrakow merged 2 commits intomainfrom
ik/graph_parallel_tweak
Feb 24, 2026
Merged

Slightly better graph parallel for Qwen3-Next#1307
ikawrakow merged 2 commits intomainfrom
ik/graph_parallel_tweak

Conversation

@ikawrakow
Copy link
Owner

The Qwen3-Next graph parallel implementation computes the delta-net attention on one GPU, see #1292. As the preceding split graph is always the FFN portion, which ends with a reduce operation, we have the FFN result, which is the input for the delta-net, on each participating GPU. This PR makes sure that the delta-net uses as input the FFN result on the GPU on which the delta-net is computed, thus avoiding a copy.

A second tweak is that for the final matrix multiplication with the output tensor we also take the FFN result from the GPU where the output tensor is stored. This may give a minor performance improvement for other models as well (I did a quick check with GLM-4.5-AIR and saw ~1% improvement for PP and TG).

These two tweaks combined give a few percent better PP and TG for Qwen3-Next when using split mode graph. Here is what I get on a 2x3090 system for IQ4_XS-quantized Qwen3-Next:

q3next_pp q3next_tg

@ikawrakow ikawrakow force-pushed the ik/graph_parallel_tweak branch from 74bc685 to 35da97d Compare February 23, 2026 13:48
@magikRUKKOLA
Copy link

magikRUKKOLA commented Feb 23, 2026

ERRDATA: random fluctuations observerd

Details

Yeah, there is a little boost for the Qwen3.5 IQ2_KL.

main:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 4.669 877.28 31.084 32.94
4096 1024 4096 4.794 854.35 30.341 33.75
4096 1024 8192 4.922 832.12 30.706 33.35
4096 1024 12288 5.054 810.50 31.140 32.88
4096 1024 16384 5.178 791.00 31.501 32.51
4096 1024 20480 5.306 772.01 32.079 31.92
4096 1024 24576 5.442 752.62 32.252 31.75
4096 1024 28672 5.561 736.54 32.735 31.28
4096 1024 32768 5.691 719.72 32.990 31.04
4096 1024 36864 5.819 703.92 33.530 30.54
4096 1024 40960 5.965 686.71 33.650 30.43
4096 1024 45056 6.080 673.67 34.005 30.11
4096 1024 49152 6.217 658.88 34.599 29.60
4096 1024 53248 6.339 646.17 34.928 29.32
4096 1024 57344 6.469 633.13 35.417 28.91
4096 1024 61440 6.608 619.84 35.809 28.60
4096 1024 65536 6.716 609.86 35.979 28.46

pr-1307:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 4.646 881.64 30.362 33.73
4096 1024 4096 4.779 857.07 30.373 33.71
4096 1024 8192 4.901 835.77 30.686 33.37
4096 1024 12288 5.028 814.69 31.211 32.81
4096 1024 16384 5.156 794.41 31.628 32.38
4096 1024 20480 5.282 775.52 31.925 32.07
4096 1024 24576 5.409 757.27 32.263 31.74
4096 1024 28672 5.554 737.53 32.611 31.40
4096 1024 32768 5.659 723.86 33.082 30.95
4096 1024 36864 5.811 704.88 33.540 30.53
4096 1024 40960 5.922 691.67 33.710 30.38
4096 1024 45056 6.052 676.84 34.233 29.91
4096 1024 49152 6.178 662.99 34.505 29.68
4096 1024 53248 6.313 648.81 34.962 29.29
4096 1024 57344 6.439 636.09 35.288 29.02
4096 1024 61440 6.576 622.89 35.409 28.92
4096 1024 65536 6.703 611.11 36.141 28.33

@ikawrakow
Copy link
Owner Author

@magikRUKKOLA

The PR doesn't do anything for Qwen-3.5 (no graph parallel there yet), so these are random fluctuations.

@ubergarm
Copy link
Contributor

ubergarm commented Feb 23, 2026

sweep-bench-Qwen3-Coder-Next-PR1307
👈 Details

-sm layer main@68bd30d9

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 1.859 2203.51 0.778 82.27
4096 64 4096 1.899 2156.42 0.768 83.32
4096 64 8192 1.941 2110.09 0.778 82.26
4096 64 12288 2.003 2044.79 0.796 80.41
4096 64 16384 2.063 1985.70 0.804 79.64
4096 64 20480 2.118 1933.79 0.815 78.51
4096 64 24576 2.171 1886.88 0.827 77.37
4096 64 28672 2.235 1832.93 0.837 76.50
4096 64 32768 2.288 1790.40 0.854 74.95
4096 64 36864 2.354 1739.69 0.857 74.67
4096 64 40960 2.410 1699.38 0.867 73.79
4096 64 45056 2.471 1657.49 0.880 72.71
4096 64 49152 2.512 1630.57 0.888 72.07
4096 64 53248 2.574 1591.25 0.899 71.22
4096 64 57344 2.637 1553.49 0.910 70.36
4096 64 61440 2.695 1519.81 0.920 69.57
4096 64 65536 2.752 1488.55 0.936 68.38
4096 64 69632 2.796 1465.01 0.941 68.05
4096 64 73728 2.859 1432.76 0.951 67.30
4096 64 77824 2.912 1406.46 0.964 66.41
4096 64 81920 2.981 1373.92 0.971 65.88
4096 64 86016 3.048 1343.81 0.988 64.76
4096 64 90112 3.108 1317.96 0.994 64.39
4096 64 94208 3.162 1295.31 1.003 63.84
4096 64 98304 3.229 1268.41 1.019 62.80
4096 64 102400 3.298 1241.88 1.025 62.47
4096 64 106496 3.350 1222.74 1.034 61.89
4096 64 110592 3.408 1201.75 1.049 60.99
4096 64 114688 3.478 1177.59 1.056 60.60
4096 64 118784 3.514 1165.74 1.075 59.53
4096 64 122880 3.581 1143.87 1.079 59.32
4096 64 126976 3.643 1124.24 1.089 58.78
4096 64 131072 3.713 1103.25 1.105 57.93

-sm graph main@68bd30d9

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.304 1777.62 0.984 65.06
4096 64 4096 2.427 1687.66 0.965 66.29
4096 64 8192 2.351 1742.39 0.973 65.75
4096 64 12288 2.349 1743.45 0.976 65.60
4096 64 16384 2.415 1696.05 0.982 65.15
4096 64 20480 2.421 1691.54 1.006 63.64
4096 64 24576 2.453 1669.98 1.010 63.39
4096 64 28672 2.478 1652.84 1.012 63.23
4096 64 32768 2.503 1636.27 1.015 63.03
4096 64 36864 2.541 1611.90 1.021 62.67
4096 64 40960 2.568 1595.17 1.028 62.26
4096 64 45056 2.602 1574.42 1.048 61.08
4096 64 49152 2.632 1556.10 1.049 61.04
4096 64 53248 2.658 1540.90 1.053 60.80
4096 64 57344 2.696 1519.36 1.056 60.62
4096 64 61440 2.727 1501.77 1.064 60.17
4096 64 65536 2.747 1491.30 1.079 59.29
4096 64 69632 2.792 1466.85 1.082 59.14
4096 64 73728 2.813 1456.27 1.085 58.97
4096 64 77824 2.853 1435.58 1.092 58.61
4096 64 81920 2.881 1421.91 1.096 58.38
4096 64 86016 2.915 1405.32 1.107 57.79
4096 64 90112 2.941 1392.88 1.111 57.62
4096 64 94208 2.978 1375.39 1.116 57.36
4096 64 98304 3.000 1365.32 1.118 57.23
4096 64 102400 3.042 1346.30 1.123 56.97
4096 64 106496 3.074 1332.34 1.129 56.66
4096 64 110592 3.118 1313.62 1.141 56.08
4096 64 114688 3.131 1308.08 1.142 56.06
4096 64 118784 3.167 1293.41 1.146 55.84
4096 64 122880 3.199 1280.21 1.153 55.52
4096 64 126976 3.227 1269.21 1.157 55.31
4096 64 131072 3.254 1258.58 1.169 54.77

-sm graph PR1307 ik/graph_parallel_tweak@35da97d5

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 2.019 2028.63 0.910 70.30
4096 64 4096 2.038 2010.05 0.897 71.35
4096 64 8192 2.051 1996.71 0.898 71.28
4096 64 12288 2.068 1980.88 0.906 70.62
4096 64 16384 2.096 1954.19 0.918 69.73
4096 64 20480 2.125 1927.30 0.929 68.91
4096 64 24576 2.150 1905.39 0.941 68.05
4096 64 28672 2.187 1872.52 0.947 67.60
4096 64 32768 2.215 1848.85 0.951 67.29
4096 64 36864 2.230 1836.76 0.958 66.83
4096 64 40960 2.267 1807.15 0.961 66.63
4096 64 45056 2.300 1780.66 0.974 65.68
4096 64 49152 2.326 1760.63 0.977 65.50
4096 64 53248 2.361 1735.13 0.984 65.03
4096 64 57344 2.401 1705.98 0.991 64.56
4096 64 61440 2.428 1687.29 0.994 64.37
4096 64 65536 2.458 1666.43 1.011 63.31
4096 64 69632 2.491 1644.53 1.012 63.26
4096 64 73728 2.525 1622.12 1.015 63.06
4096 64 77824 2.561 1599.08 1.025 62.47
4096 64 81920 2.596 1577.83 1.027 62.29
4096 64 86016 2.637 1553.41 1.042 61.42
4096 64 90112 2.664 1537.81 1.045 61.23
4096 64 94208 2.711 1511.03 1.047 61.13
4096 64 98304 2.722 1504.53 1.051 60.91
4096 64 102400 2.768 1479.87 1.058 60.48
4096 64 106496 2.792 1467.23 1.061 60.30
4096 64 110592 2.834 1445.06 1.075 59.56
4096 64 114688 2.856 1434.32 1.078 59.34
4096 64 118784 2.882 1421.25 1.082 59.16
4096 64 122880 2.930 1397.73 1.086 58.91
4096 64 126976 2.947 1390.05 1.094 58.50
4096 64 131072 2.976 1376.35 1.104 57.96

@ikawrakow ikawrakow merged commit 7065488 into main Feb 24, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 25, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 25, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 25, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Make sure we pick the reduced tensor from the right GPU

* Minor
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 27, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 27, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 27, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 3, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 3, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 7, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants