Skip to content

Graph parallel for Qwen3-Next#1292

Merged
ikawrakow merged 2 commits intomainfrom
ik/sm_graph_q3next
Feb 23, 2026
Merged

Graph parallel for Qwen3-Next#1292
ikawrakow merged 2 commits intomainfrom
ik/sm_graph_q3next

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Feb 20, 2026

I wanted to see how far one can get with graph parallel for Qwen3-Next. Spoiler alert: not very far.

Nevertheless, here is the PR.

Parallelizing the recurrent attention over 2 or more GPUs is basically hopeless, so that runs on a single GPU. Standard attention layers (1 out of 4 for Qwen3-Next) and FFN are split between GPUs. Why doesn't this help performance? My guess is that both, the FFN matrix multiplications and the standard attention computations, are much too small to derive a significant benefit from splitting them between GPUs (FFN hidden dimension is just 512, there are just 2 KV attention heads). At the same time we pay the price of synchronization / data exchange between the GPUs, so the net effect is lower performance.

Despite not actually benefiting from graph parallel I may still merge this PR because one of the effects is that Qwen3-Next now uses the existing standard attention graph building methods, so the bespoke implementation has been removed. The PR looks bigger than it actually is because it contains the changes of #1288 that have not been merged yet.

Anyway, here sweep-bench results for split mode layer and split mode graph (this PR) on a 2x3090 system. The model is quantized with IQ4_XS, and I have used Q8_0 KV cache to be able to go to a context of 100k tokens. We see that PP with split mode graph becomes better at around 64k tokens. TG is roughly on par with split mode layer only near 100k tokens.

Split mode graph

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 1.142 1793.52 1.582 80.91
2048 128 2048 1.022 2003.78 1.573 81.39
2048 128 4096 0.974 2103.06 1.586 80.72
2048 128 6144 0.979 2092.72 1.597 80.16
2048 128 8192 0.992 2064.25 1.611 79.46
2048 128 10240 1.001 2045.93 1.624 78.80
2048 128 12288 1.009 2029.11 1.632 78.42
2048 128 14336 1.016 2014.99 1.648 77.65
2048 128 16384 1.023 2002.24 1.668 76.75
2048 128 18432 1.032 1984.71 1.686 75.90
2048 128 20480 1.038 1972.92 1.701 75.24
2048 128 22528 1.043 1963.13 1.724 74.26
2048 128 24576 1.052 1946.95 1.732 73.91
2048 128 26624 1.057 1937.13 1.744 73.39
2048 128 28672 1.064 1925.70 1.754 72.96
2048 128 30720 1.073 1907.94 1.764 72.55
2048 128 32768 1.078 1899.64 1.774 72.17
2048 128 34816 1.081 1894.87 1.782 71.82
2048 128 36864 1.091 1876.36 1.792 71.41
2048 128 38912 1.097 1866.28 1.801 71.07
2048 128 40960 1.107 1849.23 1.816 70.47
2048 128 43008 1.110 1844.65 1.843 69.47
2048 128 45056 1.117 1833.49 1.850 69.19
2048 128 47104 1.127 1816.43 1.858 68.89
2048 128 49152 1.134 1805.67 1.863 68.69
2048 128 51200 1.140 1795.90 1.874 68.31
2048 128 53248 1.147 1784.87 1.884 67.95
2048 128 55296 1.152 1778.36 1.895 67.56
2048 128 57344 1.160 1764.84 1.905 67.20
2048 128 59392 1.169 1752.30 1.908 67.10
2048 128 61440 1.177 1739.71 1.917 66.78
2048 128 63488 1.183 1730.89 1.944 65.86
2048 128 65536 1.190 1721.12 1.953 65.55
2048 128 67584 1.201 1705.49 1.963 65.20
2048 128 69632 1.206 1697.68 1.970 64.99
2048 128 71680 1.213 1688.58 1.979 64.69
2048 128 73728 1.221 1676.63 1.987 64.42
2048 128 75776 1.233 1661.66 1.996 64.13
2048 128 77824 1.236 1657.49 2.013 63.59
2048 128 79872 1.248 1641.32 2.020 63.35
2048 128 81920 1.254 1633.39 2.043 62.66
2048 128 83968 1.265 1618.68 2.057 62.22
2048 128 86016 1.270 1612.53 2.063 62.04
2048 128 88064 1.282 1597.75 2.068 61.89
2048 128 90112 1.288 1590.46 2.080 61.53
2048 128 92160 1.294 1583.26 2.092 61.20
2048 128 94208 1.305 1569.33 2.118 60.43
2048 128 96256 1.311 1562.42 2.110 60.67

Split mode layer

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.897 2283.81 1.158 110.51
2048 128 2048 0.799 2564.45 1.154 110.93
2048 128 4096 0.811 2525.24 1.165 109.85
2048 128 6144 0.822 2491.46 1.184 108.10
2048 128 8192 0.834 2455.28 1.208 105.98
2048 128 10240 0.848 2414.29 1.234 103.74
2048 128 12288 0.862 2375.13 1.278 100.12
2048 128 14336 0.870 2352.75 1.280 100.00
2048 128 16384 0.879 2330.19 1.295 98.86
2048 128 18432 0.894 2289.64 1.324 96.69
2048 128 20480 0.907 2258.41 1.333 96.02
2048 128 22528 0.916 2234.72 1.367 93.63
2048 128 24576 0.930 2203.23 1.383 92.57
2048 128 26624 0.942 2173.18 1.398 91.58
2048 128 28672 0.955 2144.06 1.415 90.48
2048 128 30720 0.968 2115.83 1.434 89.29
2048 128 32768 0.981 2087.85 1.469 87.12
2048 128 34816 0.992 2063.89 1.495 85.59
2048 128 36864 0.999 2049.52 1.507 84.92
2048 128 38912 1.012 2023.99 1.521 84.15
2048 128 40960 1.025 1998.37 1.542 83.02
2048 128 43008 1.042 1964.56 1.571 81.48
2048 128 45056 1.054 1942.22 1.589 80.57
2048 128 47104 1.069 1915.01 1.606 79.70
2048 128 49152 1.080 1896.23 1.625 78.75
2048 128 51200 1.097 1866.63 1.641 77.99
2048 128 53248 1.100 1861.49 1.673 76.51
2048 128 55296 1.116 1834.68 1.693 75.61
2048 128 57344 1.139 1797.52 1.709 74.91
2048 128 59392 1.144 1789.85 1.728 74.10
2048 128 61440 1.161 1763.33 1.746 73.32
2048 128 63488 1.177 1739.84 1.783 71.80
2048 128 65536 1.184 1729.79 1.799 71.16
2048 128 67584 1.202 1703.98 1.816 70.49
2048 128 69632 1.214 1686.35 1.834 69.79
2048 128 71680 1.231 1664.32 1.851 69.15
2048 128 73728 1.239 1652.93 1.890 67.72
2048 128 75776 1.256 1630.05 1.908 67.07
2048 128 77824 1.274 1607.10 1.927 66.42
2048 128 79872 1.288 1589.84 1.947 65.75
2048 128 81920 1.302 1573.43 1.965 65.14
2048 128 83968 1.311 1562.40 1.990 64.32
2048 128 86016 1.326 1544.34 2.015 63.54
2048 128 88064 1.342 1525.84 2.035 62.91
2048 128 90112 1.342 1525.99 2.050 62.45
2048 128 92160 1.357 1509.05 2.068 61.90
2048 128 94208 1.378 1486.01 2.091 61.21
2048 128 96256 1.390 1473.45 2.122 60.32

@ubergarm
Copy link
Contributor

Yes, -sm graph seems to be working in limited testing and eventually can become faster at long context depths.

sweep-bench-Qwen3-Coder-Next-PR1292
👈 Details

-sm layer

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -fa on \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 2048 -b 2048 \
  --warmup-batch
  # alternate -sm graph and -sm layer for these two tests
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 64 0 0.870 2353.68 0.775 82.54
2048 64 2048 0.877 2336.33 0.762 83.95
2048 64 4096 0.888 2306.60 0.763 83.83
2048 64 6144 0.899 2278.82 0.768 83.32
2048 64 8192 0.909 2253.72 0.775 82.53
2048 64 10240 0.923 2219.65 0.785 81.50
2048 64 12288 0.935 2190.40 0.796 80.36
2048 64 14336 0.942 2173.39 0.799 80.12
2048 64 16384 0.958 2138.06 0.803 79.69
2048 64 18432 0.967 2117.92 0.806 79.37
2048 64 20480 0.983 2083.52 0.811 78.88
2048 64 22528 0.995 2057.90 0.824 77.65
2048 64 24576 1.013 2021.20 0.827 77.43
2048 64 26624 1.025 1998.44 0.830 77.08
2048 64 28672 1.036 1977.51 0.836 76.57
2048 64 30720 1.051 1949.52 0.839 76.31
2048 64 32768 1.067 1919.87 0.852 75.10
2048 64 34816 1.075 1905.99 0.854 74.97
2048 64 36864 1.090 1878.64 0.856 74.74
2048 64 38912 1.106 1851.97 0.862 74.23
2048 64 40960 1.125 1820.78 0.867 73.80
2048 64 43008 1.136 1803.43 0.878 72.89
2048 64 45056 1.145 1787.90 0.881 72.63
2048 64 47104 1.165 1758.37 0.882 72.56
2048 64 49152 1.178 1739.07 0.888 72.04
2048 64 51200 1.189 1722.99 0.894 71.61
2048 64 53248 1.192 1718.40 0.899 71.19
2048 64 55296 1.211 1691.65 0.910 70.35
2048 64 57344 1.227 1668.64 0.912 70.17
2048 64 59392 1.249 1640.36 0.916 69.87
2048 64 61440 1.255 1632.52 0.921 69.46
2048 64 63488 1.278 1601.92 0.926 69.13
2048 64 65536 1.285 1594.29 0.936 68.34
2048 64 67584 1.306 1568.20 0.938 68.22
2048 64 69632 1.320 1551.54 0.946 67.67
2048 64 71680 1.330 1540.41 0.948 67.48
2048 64 73728 1.351 1515.61 0.953 67.18
2048 64 75776 1.364 1501.78 0.964 66.39
2048 64 77824 1.380 1484.42 0.965 66.35
2048 64 79872 1.394 1468.92 0.968 66.15
2048 64 81920 1.399 1464.37 0.972 65.84
2048 64 83968 1.424 1438.34 0.978 65.45
2048 64 86016 1.431 1431.39 0.990 64.62
2048 64 88064 1.453 1409.48 0.992 64.48
2048 64 90112 1.453 1409.53 0.994 64.37
2048 64 92160 1.469 1394.22 1.000 64.02
2048 64 94208 1.494 1371.07 1.003 63.79
2048 64 96256 1.513 1353.61 1.011 63.31
2048 64 98304 1.520 1347.46 1.020 62.74
2048 64 100352 1.539 1330.74 1.022 62.62
2048 64 102400 1.549 1322.38 1.025 62.44
2048 64 104448 1.552 1319.20 1.029 62.19
2048 64 106496 1.580 1295.80 1.037 61.69
2048 64 108544 1.589 1288.76 1.050 60.97
2048 64 110592 1.604 1277.13 1.049 60.98
2048 64 112640 1.618 1265.89 1.053 60.77
2048 64 114688 1.628 1257.64 1.057 60.55
2048 64 116736 1.646 1244.22 1.063 60.22
2048 64 118784 1.652 1240.07 1.073 59.65
2048 64 120832 1.664 1230.89 1.079 59.34
2048 64 122880 1.676 1221.81 1.079 59.29
2048 64 124928 1.693 1209.88 1.083 59.11
2048 64 126976 1.700 1204.52 1.089 58.74
2048 64 129024 1.718 1192.34 1.100 58.20
2048 64 131072 1.736 1179.95 1.105 57.94
2048 64 133120 1.754 1167.44 1.106 57.85

-sm graph

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 64 0 1.063 1927.05 0.980 65.32
2048 64 2048 1.120 1828.65 0.966 66.26
2048 64 4096 1.071 1912.96 0.968 66.13
2048 64 6144 1.072 1911.04 0.965 66.32
2048 64 8192 1.077 1901.32 0.961 66.57
2048 64 10240 1.086 1886.50 0.967 66.17
2048 64 12288 1.091 1877.21 0.976 65.59
2048 64 14336 1.095 1870.08 0.980 65.31
2048 64 16384 1.106 1852.43 0.984 65.03
2048 64 18432 1.107 1850.18 0.991 64.59
2048 64 20480 1.114 1838.05 0.994 64.36
2048 64 22528 1.123 1823.94 1.008 63.51
2048 64 24576 1.129 1813.33 1.010 63.37
2048 64 26624 1.134 1805.60 1.011 63.30
2048 64 28672 1.142 1793.90 1.011 63.31
2048 64 30720 1.148 1784.33 1.015 63.08
2048 64 32768 1.156 1772.29 1.016 62.97
2048 64 34816 1.163 1760.76 1.018 62.88
2048 64 36864 1.169 1752.12 1.023 62.55
2048 64 38912 1.178 1737.97 1.026 62.38
2048 64 40960 1.183 1730.67 1.027 62.33
2048 64 43008 1.193 1716.95 1.038 61.64
2048 64 45056 1.199 1708.02 1.041 61.45
2048 64 47104 1.205 1699.26 1.041 61.46
2048 64 49152 1.216 1683.94 1.044 61.32
2048 64 51200 1.217 1683.13 1.046 61.21
2048 64 53248 1.229 1666.57 1.048 61.09
2048 64 55296 1.239 1653.14 1.049 60.98
2048 64 57344 1.247 1642.30 1.052 60.85
2048 64 59392 1.254 1632.81 1.055 60.66
2048 64 61440 1.266 1617.56 1.060 60.37
2048 64 63488 1.268 1615.40 1.061 60.34
2048 64 65536 1.271 1611.18 1.071 59.77
2048 64 67584 1.291 1586.48 1.077 59.44
2048 64 69632 1.295 1580.93 1.077 59.43
2048 64 71680 1.306 1567.64 1.080 59.27
2048 64 73728 1.311 1562.14 1.083 59.12
2048 64 75776 1.318 1554.26 1.084 59.05
2048 64 77824 1.329 1541.39 1.087 58.89
2048 64 79872 1.335 1533.70 1.088 58.81
2048 64 81920 1.341 1526.89 1.092 58.59
2048 64 83968 1.351 1516.46 1.095 58.46
2048 64 86016 1.365 1500.81 1.105 57.94
2048 64 88064 1.364 1501.46 1.104 57.97
2048 64 90112 1.375 1489.32 1.107 57.80
2048 64 92160 1.377 1487.36 1.109 57.72
2048 64 94208 1.390 1473.34 1.109 57.69
2048 64 96256 1.400 1462.59 1.111 57.62
2048 64 98304 1.408 1454.95 1.116 57.37
2048 64 100352 1.417 1445.39 1.120 57.12
2048 64 102400 1.422 1440.13 1.122 57.05
2048 64 104448 1.431 1431.33 1.123 57.00
2048 64 106496 1.443 1418.84 1.127 56.78
2048 64 108544 1.452 1410.13 1.135 56.38
2048 64 110592 1.458 1404.76 1.137 56.27
2048 64 112640 1.470 1393.66 1.138 56.22
2048 64 114688 1.479 1384.70 1.141 56.08
2048 64 116736 1.476 1387.32 1.144 55.96
2048 64 118784 1.494 1370.77 1.146 55.87
2048 64 120832 1.495 1369.77 1.148 55.76
2048 64 122880 1.501 1364.81 1.150 55.66
2048 64 124928 1.513 1353.29 1.153 55.52
2048 64 126976 1.527 1341.45 1.156 55.38
2048 64 129024 1.531 1337.60 1.171 54.67
2048 64 131072 1.534 1335.24 1.166 54.87
2048 64 133120 1.544 1326.71 1.168 54.78

@chulucninh09
Copy link

Energy-wise, the graph is more overhead and consume more energy in this case right?

@ikawrakow
Copy link
Owner Author

Energy-wise, the graph is more overhead and consume more energy in this case right?

Likely yes. But you can watch GPU utilization with nvtop and compare that to running with split mode layer.

@chulucninh09
Copy link

Interestingly, due to using only 1 GPU for recurrent layer, GPU utilization is not 100%

image

abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* WIP

* This works, but is slower than split mode layer
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants