Slightly better graph parallel for Qwen3-Next by ikawrakow · Pull Request #1307 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-23T10:17:08Z

The Qwen3-Next graph parallel implementation computes the delta-net attention on one GPU, see #1292. As the preceding split graph is always the FFN portion, which ends with a reduce operation, we have the FFN result, which is the input for the delta-net, on each participating GPU. This PR makes sure that the delta-net uses as input the FFN result on the GPU on which the delta-net is computed, thus avoiding a copy.

A second tweak is that for the final matrix multiplication with the output tensor we also take the FFN result from the GPU where the output tensor is stored. This may give a minor performance improvement for other models as well (I did a quick check with GLM-4.5-AIR and saw ~1% improvement for PP and TG).

These two tweaks combined give a few percent better PP and TG for Qwen3-Next when using split mode graph. Here is what I get on a 2x3090 system for IQ4_XS-quantized Qwen3-Next:

magikRUKKOLA · 2026-02-23T14:17:34Z

ERRDATA: random fluctuations observerd

Details

Yeah, there is a little boost for the Qwen3.5 IQ2_KL.

main:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	4.669	877.28	31.084	32.94
4096	1024	4096	4.794	854.35	30.341	33.75
4096	1024	8192	4.922	832.12	30.706	33.35
4096	1024	12288	5.054	810.50	31.140	32.88
4096	1024	16384	5.178	791.00	31.501	32.51
4096	1024	20480	5.306	772.01	32.079	31.92
4096	1024	24576	5.442	752.62	32.252	31.75
4096	1024	28672	5.561	736.54	32.735	31.28
4096	1024	32768	5.691	719.72	32.990	31.04
4096	1024	36864	5.819	703.92	33.530	30.54
4096	1024	40960	5.965	686.71	33.650	30.43
4096	1024	45056	6.080	673.67	34.005	30.11
4096	1024	49152	6.217	658.88	34.599	29.60
4096	1024	53248	6.339	646.17	34.928	29.32
4096	1024	57344	6.469	633.13	35.417	28.91
4096	1024	61440	6.608	619.84	35.809	28.60
4096	1024	65536	6.716	609.86	35.979	28.46

pr-1307:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	4.646	881.64	30.362	33.73
4096	1024	4096	4.779	857.07	30.373	33.71
4096	1024	8192	4.901	835.77	30.686	33.37
4096	1024	12288	5.028	814.69	31.211	32.81
4096	1024	16384	5.156	794.41	31.628	32.38
4096	1024	20480	5.282	775.52	31.925	32.07
4096	1024	24576	5.409	757.27	32.263	31.74
4096	1024	28672	5.554	737.53	32.611	31.40
4096	1024	32768	5.659	723.86	33.082	30.95
4096	1024	36864	5.811	704.88	33.540	30.53
4096	1024	40960	5.922	691.67	33.710	30.38
4096	1024	45056	6.052	676.84	34.233	29.91
4096	1024	49152	6.178	662.99	34.505	29.68
4096	1024	53248	6.313	648.81	34.962	29.29
4096	1024	57344	6.439	636.09	35.288	29.02
4096	1024	61440	6.576	622.89	35.409	28.92
4096	1024	65536	6.703	611.11	36.141	28.33

ikawrakow · 2026-02-23T14:30:36Z

@magikRUKKOLA

The PR doesn't do anything for Qwen-3.5 (no graph parallel there yet), so these are random fluctuations.

ubergarm · 2026-02-23T18:35:15Z

👈 Details

-sm layer main@68bd30d9

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  --merge-qkv \
  -ger \
  -sm layer \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	1.859	2203.51	0.778	82.27
4096	64	4096	1.899	2156.42	0.768	83.32
4096	64	8192	1.941	2110.09	0.778	82.26
4096	64	12288	2.003	2044.79	0.796	80.41
4096	64	16384	2.063	1985.70	0.804	79.64
4096	64	20480	2.118	1933.79	0.815	78.51
4096	64	24576	2.171	1886.88	0.827	77.37
4096	64	28672	2.235	1832.93	0.837	76.50
4096	64	32768	2.288	1790.40	0.854	74.95
4096	64	36864	2.354	1739.69	0.857	74.67
4096	64	40960	2.410	1699.38	0.867	73.79
4096	64	45056	2.471	1657.49	0.880	72.71
4096	64	49152	2.512	1630.57	0.888	72.07
4096	64	53248	2.574	1591.25	0.899	71.22
4096	64	57344	2.637	1553.49	0.910	70.36
4096	64	61440	2.695	1519.81	0.920	69.57
4096	64	65536	2.752	1488.55	0.936	68.38
4096	64	69632	2.796	1465.01	0.941	68.05
4096	64	73728	2.859	1432.76	0.951	67.30
4096	64	77824	2.912	1406.46	0.964	66.41
4096	64	81920	2.981	1373.92	0.971	65.88
4096	64	86016	3.048	1343.81	0.988	64.76
4096	64	90112	3.108	1317.96	0.994	64.39
4096	64	94208	3.162	1295.31	1.003	63.84
4096	64	98304	3.229	1268.41	1.019	62.80
4096	64	102400	3.298	1241.88	1.025	62.47
4096	64	106496	3.350	1222.74	1.034	61.89
4096	64	110592	3.408	1201.75	1.049	60.99
4096	64	114688	3.478	1177.59	1.056	60.60
4096	64	118784	3.514	1165.74	1.075	59.53
4096	64	122880	3.581	1143.87	1.079	59.32
4096	64	126976	3.643	1124.24	1.089	58.78
4096	64	131072	3.713	1103.25	1.105	57.93

-sm graph main@68bd30d9

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	2.304	1777.62	0.984	65.06
4096	64	4096	2.427	1687.66	0.965	66.29
4096	64	8192	2.351	1742.39	0.973	65.75
4096	64	12288	2.349	1743.45	0.976	65.60
4096	64	16384	2.415	1696.05	0.982	65.15
4096	64	20480	2.421	1691.54	1.006	63.64
4096	64	24576	2.453	1669.98	1.010	63.39
4096	64	28672	2.478	1652.84	1.012	63.23
4096	64	32768	2.503	1636.27	1.015	63.03
4096	64	36864	2.541	1611.90	1.021	62.67
4096	64	40960	2.568	1595.17	1.028	62.26
4096	64	45056	2.602	1574.42	1.048	61.08
4096	64	49152	2.632	1556.10	1.049	61.04
4096	64	53248	2.658	1540.90	1.053	60.80
4096	64	57344	2.696	1519.36	1.056	60.62
4096	64	61440	2.727	1501.77	1.064	60.17
4096	64	65536	2.747	1491.30	1.079	59.29
4096	64	69632	2.792	1466.85	1.082	59.14
4096	64	73728	2.813	1456.27	1.085	58.97
4096	64	77824	2.853	1435.58	1.092	58.61
4096	64	81920	2.881	1421.91	1.096	58.38
4096	64	86016	2.915	1405.32	1.107	57.79
4096	64	90112	2.941	1392.88	1.111	57.62
4096	64	94208	2.978	1375.39	1.116	57.36
4096	64	98304	3.000	1365.32	1.118	57.23
4096	64	102400	3.042	1346.30	1.123	56.97
4096	64	106496	3.074	1332.34	1.129	56.66
4096	64	110592	3.118	1313.62	1.141	56.08
4096	64	114688	3.131	1308.08	1.142	56.06
4096	64	118784	3.167	1293.41	1.146	55.84
4096	64	122880	3.199	1280.21	1.153	55.52
4096	64	126976	3.227	1269.21	1.157	55.31
4096	64	131072	3.254	1258.58	1.169	54.77

-sm graph PR1307 ik/graph_parallel_tweak@35da97d5

model=/mnt/raid/models/ggml-org/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q8_0.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -ger \
  -sm graph \
  -ngl 99 \
  --threads 1 \
  --n-predict 64 \
  -ub 4096 -b 4096 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	2.019	2028.63	0.910	70.30
4096	64	4096	2.038	2010.05	0.897	71.35
4096	64	8192	2.051	1996.71	0.898	71.28
4096	64	12288	2.068	1980.88	0.906	70.62
4096	64	16384	2.096	1954.19	0.918	69.73
4096	64	20480	2.125	1927.30	0.929	68.91
4096	64	24576	2.150	1905.39	0.941	68.05
4096	64	28672	2.187	1872.52	0.947	67.60
4096	64	32768	2.215	1848.85	0.951	67.29
4096	64	36864	2.230	1836.76	0.958	66.83
4096	64	40960	2.267	1807.15	0.961	66.63
4096	64	45056	2.300	1780.66	0.974	65.68
4096	64	49152	2.326	1760.63	0.977	65.50
4096	64	53248	2.361	1735.13	0.984	65.03
4096	64	57344	2.401	1705.98	0.991	64.56
4096	64	61440	2.428	1687.29	0.994	64.37
4096	64	65536	2.458	1666.43	1.011	63.31
4096	64	69632	2.491	1644.53	1.012	63.26
4096	64	73728	2.525	1622.12	1.015	63.06
4096	64	77824	2.561	1599.08	1.025	62.47
4096	64	81920	2.596	1577.83	1.027	62.29
4096	64	86016	2.637	1553.41	1.042	61.42
4096	64	90112	2.664	1537.81	1.045	61.23
4096	64	94208	2.711	1511.03	1.047	61.13
4096	64	98304	2.722	1504.53	1.051	60.91
4096	64	102400	2.768	1479.87	1.058	60.48
4096	64	106496	2.792	1467.23	1.061	60.30
4096	64	110592	2.834	1445.06	1.075	59.56
4096	64	114688	2.856	1434.32	1.078	59.34
4096	64	118784	2.882	1421.25	1.082	59.16
4096	64	122880	2.930	1397.73	1.086	58.91
4096	64	126976	2.947	1390.05	1.094	58.50
4096	64	131072	2.976	1376.35	1.104	57.96

…" LCPP part

* Make sure we pick the reduced tensor from the right GPU * Minor

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench