Graph parallel for Qwen3-Next by ikawrakow · Pull Request #1292 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-20T16:46:26Z

I wanted to see how far one can get with graph parallel for Qwen3-Next. Spoiler alert: not very far.

Nevertheless, here is the PR.

Parallelizing the recurrent attention over 2 or more GPUs is basically hopeless, so that runs on a single GPU. Standard attention layers (1 out of 4 for Qwen3-Next) and FFN are split between GPUs. Why doesn't this help performance? My guess is that both, the FFN matrix multiplications and the standard attention computations, are much too small to derive a significant benefit from splitting them between GPUs (FFN hidden dimension is just 512, there are just 2 KV attention heads). At the same time we pay the price of synchronization / data exchange between the GPUs, so the net effect is lower performance.

Despite not actually benefiting from graph parallel I may still merge this PR because one of the effects is that Qwen3-Next now uses the existing standard attention graph building methods, so the bespoke implementation has been removed. The PR looks bigger than it actually is because it contains the changes of #1288 that have not been merged yet.

Anyway, here sweep-bench results for split mode layer and split mode graph (this PR) on a 2x3090 system. The model is quantized with IQ4_XS, and I have used Q8_0 KV cache to be able to go to a context of 100k tokens. We see that PP with split mode graph becomes better at around 64k tokens. TG is roughly on par with split mode layer only near 100k tokens.

Split mode graph

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	1.142	1793.52	1.582	80.91
2048	128	2048	1.022	2003.78	1.573	81.39
2048	128	4096	0.974	2103.06	1.586	80.72
2048	128	6144	0.979	2092.72	1.597	80.16
2048	128	8192	0.992	2064.25	1.611	79.46
2048	128	10240	1.001	2045.93	1.624	78.80
2048	128	12288	1.009	2029.11	1.632	78.42
2048	128	14336	1.016	2014.99	1.648	77.65
2048	128	16384	1.023	2002.24	1.668	76.75
2048	128	18432	1.032	1984.71	1.686	75.90
2048	128	20480	1.038	1972.92	1.701	75.24
2048	128	22528	1.043	1963.13	1.724	74.26
2048	128	24576	1.052	1946.95	1.732	73.91
2048	128	26624	1.057	1937.13	1.744	73.39
2048	128	28672	1.064	1925.70	1.754	72.96
2048	128	30720	1.073	1907.94	1.764	72.55
2048	128	32768	1.078	1899.64	1.774	72.17
2048	128	34816	1.081	1894.87	1.782	71.82
2048	128	36864	1.091	1876.36	1.792	71.41
2048	128	38912	1.097	1866.28	1.801	71.07
2048	128	40960	1.107	1849.23	1.816	70.47
2048	128	43008	1.110	1844.65	1.843	69.47
2048	128	45056	1.117	1833.49	1.850	69.19
2048	128	47104	1.127	1816.43	1.858	68.89
2048	128	49152	1.134	1805.67	1.863	68.69
2048	128	51200	1.140	1795.90	1.874	68.31
2048	128	53248	1.147	1784.87	1.884	67.95
2048	128	55296	1.152	1778.36	1.895	67.56
2048	128	57344	1.160	1764.84	1.905	67.20
2048	128	59392	1.169	1752.30	1.908	67.10
2048	128	61440	1.177	1739.71	1.917	66.78
2048	128	63488	1.183	1730.89	1.944	65.86
2048	128	65536	1.190	1721.12	1.953	65.55
2048	128	67584	1.201	1705.49	1.963	65.20
2048	128	69632	1.206	1697.68	1.970	64.99
2048	128	71680	1.213	1688.58	1.979	64.69
2048	128	73728	1.221	1676.63	1.987	64.42
2048	128	75776	1.233	1661.66	1.996	64.13
2048	128	77824	1.236	1657.49	2.013	63.59
2048	128	79872	1.248	1641.32	2.020	63.35
2048	128	81920	1.254	1633.39	2.043	62.66
2048	128	83968	1.265	1618.68	2.057	62.22
2048	128	86016	1.270	1612.53	2.063	62.04
2048	128	88064	1.282	1597.75	2.068	61.89
2048	128	90112	1.288	1590.46	2.080	61.53
2048	128	92160	1.294	1583.26	2.092	61.20
2048	128	94208	1.305	1569.33	2.118	60.43
2048	128	96256	1.311	1562.42	2.110	60.67

Split mode layer

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.897	2283.81	1.158	110.51
2048	128	2048	0.799	2564.45	1.154	110.93
2048	128	4096	0.811	2525.24	1.165	109.85
2048	128	6144	0.822	2491.46	1.184	108.10
2048	128	8192	0.834	2455.28	1.208	105.98
2048	128	10240	0.848	2414.29	1.234	103.74
2048	128	12288	0.862	2375.13	1.278	100.12
2048	128	14336	0.870	2352.75	1.280	100.00
2048	128	16384	0.879	2330.19	1.295	98.86
2048	128	18432	0.894	2289.64	1.324	96.69
2048	128	20480	0.907	2258.41	1.333	96.02
2048	128	22528	0.916	2234.72	1.367	93.63
2048	128	24576	0.930	2203.23	1.383	92.57
2048	128	26624	0.942	2173.18	1.398	91.58
2048	128	28672	0.955	2144.06	1.415	90.48
2048	128	30720	0.968	2115.83	1.434	89.29
2048	128	32768	0.981	2087.85	1.469	87.12
2048	128	34816	0.992	2063.89	1.495	85.59
2048	128	36864	0.999	2049.52	1.507	84.92
2048	128	38912	1.012	2023.99	1.521	84.15
2048	128	40960	1.025	1998.37	1.542	83.02
2048	128	43008	1.042	1964.56	1.571	81.48
2048	128	45056	1.054	1942.22	1.589	80.57
2048	128	47104	1.069	1915.01	1.606	79.70
2048	128	49152	1.080	1896.23	1.625	78.75
2048	128	51200	1.097	1866.63	1.641	77.99
2048	128	53248	1.100	1861.49	1.673	76.51
2048	128	55296	1.116	1834.68	1.693	75.61
2048	128	57344	1.139	1797.52	1.709	74.91
2048	128	59392	1.144	1789.85	1.728	74.10
2048	128	61440	1.161	1763.33	1.746	73.32
2048	128	63488	1.177	1739.84	1.783	71.80
2048	128	65536	1.184	1729.79	1.799	71.16
2048	128	67584	1.202	1703.98	1.816	70.49
2048	128	69632	1.214	1686.35	1.834	69.79
2048	128	71680	1.231	1664.32	1.851	69.15
2048	128	73728	1.239	1652.93	1.890	67.72
2048	128	75776	1.256	1630.05	1.908	67.07
2048	128	77824	1.274	1607.10	1.927	66.42
2048	128	79872	1.288	1589.84	1.947	65.75
2048	128	81920	1.302	1573.43	1.965	65.14
2048	128	83968	1.311	1562.40	1.990	64.32
2048	128	86016	1.326	1544.34	2.015	63.54
2048	128	88064	1.342	1525.84	2.035	62.91
2048	128	90112	1.342	1525.99	2.050	62.45
2048	128	92160	1.357	1509.05	2.068	61.90
2048	128	94208	1.378	1486.01	2.091	61.21
2048	128	96256	1.390	1473.45	2.122	60.32

ubergarm · 2026-02-20T20:29:00Z

Yes, -sm graph seems to be working in limited testing and eventually can become faster at long context depths.

👈 Details

-sm layer

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -fa on \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 2048 -b 2048 \
  --warmup-batch
  # alternate -sm graph and -sm layer for these two tests

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	64	0	0.870	2353.68	0.775	82.54
2048	64	2048	0.877	2336.33	0.762	83.95
2048	64	4096	0.888	2306.60	0.763	83.83
2048	64	6144	0.899	2278.82	0.768	83.32
2048	64	8192	0.909	2253.72	0.775	82.53
2048	64	10240	0.923	2219.65	0.785	81.50
2048	64	12288	0.935	2190.40	0.796	80.36
2048	64	14336	0.942	2173.39	0.799	80.12
2048	64	16384	0.958	2138.06	0.803	79.69
2048	64	18432	0.967	2117.92	0.806	79.37
2048	64	20480	0.983	2083.52	0.811	78.88
2048	64	22528	0.995	2057.90	0.824	77.65
2048	64	24576	1.013	2021.20	0.827	77.43
2048	64	26624	1.025	1998.44	0.830	77.08
2048	64	28672	1.036	1977.51	0.836	76.57
2048	64	30720	1.051	1949.52	0.839	76.31
2048	64	32768	1.067	1919.87	0.852	75.10
2048	64	34816	1.075	1905.99	0.854	74.97
2048	64	36864	1.090	1878.64	0.856	74.74
2048	64	38912	1.106	1851.97	0.862	74.23
2048	64	40960	1.125	1820.78	0.867	73.80
2048	64	43008	1.136	1803.43	0.878	72.89
2048	64	45056	1.145	1787.90	0.881	72.63
2048	64	47104	1.165	1758.37	0.882	72.56
2048	64	49152	1.178	1739.07	0.888	72.04
2048	64	51200	1.189	1722.99	0.894	71.61
2048	64	53248	1.192	1718.40	0.899	71.19
2048	64	55296	1.211	1691.65	0.910	70.35
2048	64	57344	1.227	1668.64	0.912	70.17
2048	64	59392	1.249	1640.36	0.916	69.87
2048	64	61440	1.255	1632.52	0.921	69.46
2048	64	63488	1.278	1601.92	0.926	69.13
2048	64	65536	1.285	1594.29	0.936	68.34
2048	64	67584	1.306	1568.20	0.938	68.22
2048	64	69632	1.320	1551.54	0.946	67.67
2048	64	71680	1.330	1540.41	0.948	67.48
2048	64	73728	1.351	1515.61	0.953	67.18
2048	64	75776	1.364	1501.78	0.964	66.39
2048	64	77824	1.380	1484.42	0.965	66.35
2048	64	79872	1.394	1468.92	0.968	66.15
2048	64	81920	1.399	1464.37	0.972	65.84
2048	64	83968	1.424	1438.34	0.978	65.45
2048	64	86016	1.431	1431.39	0.990	64.62
2048	64	88064	1.453	1409.48	0.992	64.48
2048	64	90112	1.453	1409.53	0.994	64.37
2048	64	92160	1.469	1394.22	1.000	64.02
2048	64	94208	1.494	1371.07	1.003	63.79
2048	64	96256	1.513	1353.61	1.011	63.31
2048	64	98304	1.520	1347.46	1.020	62.74
2048	64	100352	1.539	1330.74	1.022	62.62
2048	64	102400	1.549	1322.38	1.025	62.44
2048	64	104448	1.552	1319.20	1.029	62.19
2048	64	106496	1.580	1295.80	1.037	61.69
2048	64	108544	1.589	1288.76	1.050	60.97
2048	64	110592	1.604	1277.13	1.049	60.98
2048	64	112640	1.618	1265.89	1.053	60.77
2048	64	114688	1.628	1257.64	1.057	60.55
2048	64	116736	1.646	1244.22	1.063	60.22
2048	64	118784	1.652	1240.07	1.073	59.65
2048	64	120832	1.664	1230.89	1.079	59.34
2048	64	122880	1.676	1221.81	1.079	59.29
2048	64	124928	1.693	1209.88	1.083	59.11
2048	64	126976	1.700	1204.52	1.089	58.74
2048	64	129024	1.718	1192.34	1.100	58.20
2048	64	131072	1.736	1179.95	1.105	57.94
2048	64	133120	1.754	1167.44	1.106	57.85

-sm graph

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	64	0	1.063	1927.05	0.980	65.32
2048	64	2048	1.120	1828.65	0.966	66.26
2048	64	4096	1.071	1912.96	0.968	66.13
2048	64	6144	1.072	1911.04	0.965	66.32
2048	64	8192	1.077	1901.32	0.961	66.57
2048	64	10240	1.086	1886.50	0.967	66.17
2048	64	12288	1.091	1877.21	0.976	65.59
2048	64	14336	1.095	1870.08	0.980	65.31
2048	64	16384	1.106	1852.43	0.984	65.03
2048	64	18432	1.107	1850.18	0.991	64.59
2048	64	20480	1.114	1838.05	0.994	64.36
2048	64	22528	1.123	1823.94	1.008	63.51
2048	64	24576	1.129	1813.33	1.010	63.37
2048	64	26624	1.134	1805.60	1.011	63.30
2048	64	28672	1.142	1793.90	1.011	63.31
2048	64	30720	1.148	1784.33	1.015	63.08
2048	64	32768	1.156	1772.29	1.016	62.97
2048	64	34816	1.163	1760.76	1.018	62.88
2048	64	36864	1.169	1752.12	1.023	62.55
2048	64	38912	1.178	1737.97	1.026	62.38
2048	64	40960	1.183	1730.67	1.027	62.33
2048	64	43008	1.193	1716.95	1.038	61.64
2048	64	45056	1.199	1708.02	1.041	61.45
2048	64	47104	1.205	1699.26	1.041	61.46
2048	64	49152	1.216	1683.94	1.044	61.32
2048	64	51200	1.217	1683.13	1.046	61.21
2048	64	53248	1.229	1666.57	1.048	61.09
2048	64	55296	1.239	1653.14	1.049	60.98
2048	64	57344	1.247	1642.30	1.052	60.85
2048	64	59392	1.254	1632.81	1.055	60.66
2048	64	61440	1.266	1617.56	1.060	60.37
2048	64	63488	1.268	1615.40	1.061	60.34
2048	64	65536	1.271	1611.18	1.071	59.77
2048	64	67584	1.291	1586.48	1.077	59.44
2048	64	69632	1.295	1580.93	1.077	59.43
2048	64	71680	1.306	1567.64	1.080	59.27
2048	64	73728	1.311	1562.14	1.083	59.12
2048	64	75776	1.318	1554.26	1.084	59.05
2048	64	77824	1.329	1541.39	1.087	58.89
2048	64	79872	1.335	1533.70	1.088	58.81
2048	64	81920	1.341	1526.89	1.092	58.59
2048	64	83968	1.351	1516.46	1.095	58.46
2048	64	86016	1.365	1500.81	1.105	57.94
2048	64	88064	1.364	1501.46	1.104	57.97
2048	64	90112	1.375	1489.32	1.107	57.80
2048	64	92160	1.377	1487.36	1.109	57.72
2048	64	94208	1.390	1473.34	1.109	57.69
2048	64	96256	1.400	1462.59	1.111	57.62
2048	64	98304	1.408	1454.95	1.116	57.37
2048	64	100352	1.417	1445.39	1.120	57.12
2048	64	102400	1.422	1440.13	1.122	57.05
2048	64	104448	1.431	1431.33	1.123	57.00
2048	64	106496	1.443	1418.84	1.127	56.78
2048	64	108544	1.452	1410.13	1.135	56.38
2048	64	110592	1.458	1404.76	1.137	56.27
2048	64	112640	1.470	1393.66	1.138	56.22
2048	64	114688	1.479	1384.70	1.141	56.08
2048	64	116736	1.476	1387.32	1.144	55.96
2048	64	118784	1.494	1370.77	1.146	55.87
2048	64	120832	1.495	1369.77	1.148	55.76
2048	64	122880	1.501	1364.81	1.150	55.66
2048	64	124928	1.513	1353.29	1.153	55.52
2048	64	126976	1.527	1341.45	1.156	55.38
2048	64	129024	1.531	1337.60	1.171	54.67
2048	64	131072	1.534	1335.24	1.166	54.87
2048	64	133120	1.544	1326.71	1.168	54.78

chulucninh09 · 2026-02-22T06:23:56Z

Energy-wise, the graph is more overhead and consume more energy in this case right?

ikawrakow · 2026-02-22T06:28:21Z

Energy-wise, the graph is more overhead and consume more energy in this case right?

Likely yes. But you can watch GPU utilization with nvtop and compare that to running with split mode layer.

chulucninh09 · 2026-02-24T00:42:35Z

Interestingly, due to using only 1 GPU for recurrent layer, GPU utilization is not 100%

* WIP * This works, but is slower than split mode layer

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

ikawrakow added 2 commits February 21, 2026 07:34

WIP

6c2f7ad

This works, but is slower than split mode layer

463359f

ikawrakow force-pushed the ik/sm_graph_q3next branch from 01d99ad to 463359f Compare February 21, 2026 07:34

ikawrakow merged commit 5dacb53 into main Feb 23, 2026

ikawrakow mentioned this pull request Feb 23, 2026

Slightly better graph parallel for Qwen3-Next #1307

Merged

abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026

Graph parallel for Qwen3-Next (ikawrakow#1292)

843ec8b

* WIP * This works, but is slower than split mode layer

ikawrakow mentioned this pull request Feb 26, 2026

Graph parallel for dense Qwen-3.5 models #1331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph parallel for Qwen3-Next#1292

Graph parallel for Qwen3-Next#1292
ikawrakow merged 2 commits intomainfrom
ik/sm_graph_q3next

ikawrakow commented Feb 20, 2026 •

edited

Loading

Uh oh!

ubergarm commented Feb 20, 2026

-sm layer

-sm graph

Uh oh!

chulucninh09 commented Feb 22, 2026

Uh oh!

ikawrakow commented Feb 22, 2026

Uh oh!

chulucninh09 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Split mode graph

Split mode layer

Uh oh!

ubergarm commented Feb 20, 2026

-sm layer

-sm graph

Uh oh!

chulucninh09 commented Feb 22, 2026

Uh oh!

ikawrakow commented Feb 22, 2026

Uh oh!

chulucninh09 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Feb 20, 2026 •

edited

Loading