Better estimate for max. nuber of compute nodes by ikawrakow · Pull Request #1296 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-21T18:08:10Z

Now you can have whatever u-batch size you want with Qwen3-Next and Qwen-3.5-MoE (or any other supported model).

ubergarm · 2026-02-21T19:49:57Z

Giving this a try with hybrid CPU+GPU with ubergarm/Qwen3.5-397B-A17B Q3_K 179.97 GiB (3.90 BPW)using -sm layer (as no -sm graph for qwen35moe arch yet psure).

Interestingly, seeing increasing batch sizes improving PP fairly significantly despite offloading less routed experts to VRAM. There is a little loss to TG speeds likely due to less weights on VRAM to make room for bigger batch sizes.

sweep-bench-Qwen3 5-397B-A17B-Q3_K-PR1296

I did see one new error when increasing batch sizes to 4096, but didn't have quite enough VRAM available probably, details below.

CUDA error: an unsupported value or parameter was passed to the function
  current device: 1, in function ggml_cuda_op_mul_mat_cublas at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:1486
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error

👈 Details

default batches +28 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 40960 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(46|47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	64	0	2.385	214.70	2.653	24.12
512	64	512	2.407	212.69	2.614	24.49
512	64	1024	2.342	218.64	2.644	24.20
512	64	1536	1.963	260.80	2.624	24.39
512	64	2048	2.526	202.70	2.605	24.57
512	64	2560	2.176	235.27	2.589	24.72
512	64	3072	2.132	240.18	2.576	24.84
512	64	3584	2.535	201.96	2.695	23.74
512	64	4096	1.919	266.80	2.589	24.72
512	64	4608	2.525	202.77	2.589	24.72
512	64	5120	2.527	202.61	2.668	23.99
512	64	5632	2.477	206.70	2.639	24.25
512	64	6144	2.392	214.03	2.655	24.11
512	64	6656	2.513	203.73	2.714	23.58
512	64	7168	1.972	259.66	2.640	24.24
512	64	7680	1.921	266.51	2.595	24.66
512	64	8192	1.918	266.98	2.603	24.59
512	64	8704	1.998	256.26	2.600	24.62
512	64	9216	1.925	265.95	2.614	24.48
512	64	9728	2.535	202.01	2.611	24.51
512	64	10240	2.497	205.05	2.655	24.11
512	64	10752	2.375	215.59	2.642	24.23
512	64	11264	1.977	258.98	2.638	24.26
512	64	11776	2.087	245.32	2.611	24.51
512	64	12288	2.358	217.17	2.672	23.96
512	64	12800	2.227	229.94	2.705	23.66
512	64	13312	2.585	198.08	2.692	23.77
512	64	13824	1.956	261.79	2.683	23.85
512	64	14336	1.949	262.71	2.630	24.33
512	64	14848	2.181	234.80	2.664	24.02
512	64	15360	2.532	202.23	2.628	24.35
512	64	15872	1.969	260.05	2.631	24.32
512	64	16384	2.411	212.33	2.667	24.00
512	64	16896	2.470	207.28	2.691	23.78
512	64	17408	1.975	259.26	2.640	24.25
512	64	17920	2.501	204.74	2.651	24.14
512	64	18432	2.281	224.50	2.742	23.34
512	64	18944	2.516	203.50	2.638	24.26
512	64	19456	2.639	194.03	2.649	24.16
512	64	19968	2.042	250.70	2.683	23.85
512	64	20480	2.495	205.24	2.735	23.40
512	64	20992	2.014	254.16	2.690	23.80
512	64	21504	2.146	238.58	2.651	24.14
512	64	22016	1.978	258.83	2.676	23.92
512	64	22528	2.437	210.13	2.687	23.82
512	64	23040	1.999	256.08	2.699	23.72
512	64	23552	2.566	199.57	2.663	24.04
512	64	24064	1.985	257.87	2.668	23.98
512	64	24576	1.993	256.93	2.661	24.05
512	64	25088	2.533	202.17	2.669	23.98
512	64	25600	2.203	232.39	2.715	23.57
512	64	26112	2.514	203.65	2.756	23.22
512	64	26624	2.556	200.29	2.700	23.70
512	64	27136	2.652	193.07	2.717	23.56
512	64	27648	2.481	206.41	2.720	23.53
512	64	28160	2.409	212.57	2.724	23.49
512	64	28672	2.186	234.21	2.743	23.33
512	64	29184	2.009	254.87	2.683	23.85
512	64	29696	2.413	212.22	2.712	23.60
512	64	30208	2.455	208.53	2.737	23.39
512	64	30720	2.017	253.90	2.703	23.68
512	64	31232	2.469	207.39	2.682	23.86
512	64	31744	2.610	196.16	2.777	23.05
512	64	32256	2.346	218.24	2.779	23.03
512	64	32768	2.624	195.14	2.723	23.50
512	64	33280	2.043	250.60	2.781	23.01
512	64	33792	2.730	187.54	2.711	23.61
512	64	34304	2.059	248.72	2.744	23.32
512	64	34816	2.681	190.96	2.717	23.56

-ub 2048 -b 2048 +26 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 36864 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 2048 -b 2048 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	64	0	5.870	348.90	2.744	23.32
2048	64	2048	5.918	346.07	2.626	24.37
2048	64	4096	5.987	342.06	2.635	24.29
2048	64	6144	6.044	338.85	2.793	22.92
2048	64	8192	6.029	339.70	2.677	23.91
2048	64	10240	6.092	336.16	2.660	24.06
2048	64	12288	6.138	333.64	2.708	23.64
2048	64	14336	6.164	332.24	2.793	22.91
2048	64	16384	6.241	328.16	2.693	23.77
2048	64	18432	6.218	329.37	2.674	23.93
2048	64	20480	6.271	326.57	2.666	24.01
2048	64	22528	6.286	325.83	2.768	23.13
2048	64	24576	6.325	323.80	2.707	23.64
2048	64	26624	6.346	322.70	2.692	23.77
2048	64	28672	6.378	321.09	2.700	23.71
2048	64	30720	6.434	318.30	2.740	23.35
2048	64	32768	6.467	316.67	2.753	23.24
2048	64	34816	6.520	314.11	2.754	23.24

-ub 4096 -b 4096 +24 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 36864 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 4096 -b 4096 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	8.990	455.60	2.757	23.22
4096	64	4096	9.185	445.93	2.658	24.08
4096	64	8192	9.259	442.36	2.721	23.52
4096	64	12288	9.380	436.68	2.662	24.04
4096	64	16384	9.508	430.80	2.750	23.28
4096	64	20480	9.656	424.18	2.724	23.49
4096	64	24576	9.747	420.23	2.716	23.57
4096	64	28672	9.898	413.81	2.734	23.41
4096	64	32768	10.023	408.66	2.795	22.90

-ub 8192 -b 8192 +22 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 40960 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 8192 -b 8192 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	64	0	15.743	520.35	2.744	23.32
8192	64	8192	16.385	499.98	2.739	23.36
8192	64	16384	16.712	490.18	2.772	23.09
8192	64	24576	17.205	476.13	2.821	22.69
8192	64	32768	17.703	462.76	2.795	22.89

-ub 4096 -b 4096 +26 exps ERROR

./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 36864 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 4096 -b 4096 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	8.756	467.79	2.714	23.59
4096	64	4096	8.964	456.93	2.639	24.25
4096	64	8192	9.016	454.33	2.673	23.95

CUDA error: an unsupported value or parameter was passed to the function
  current device: 1, in function ggml_cuda_op_mul_mat_cublas at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:1486
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error

MrHills-rs · 2026-02-21T20:30:54Z

Using -ub 8192 shows significant improvement in PP for me.

Interestingly, I see a small uplift in TG (4%) using the same parameters in the new PR.

Main (3 hours old pr) -b 4096 -ub 4096 66323b9

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	6.347	645.33	7.677	16.67
4096	128	4096	6.365	643.47	7.693	16.64
4096	128	8192	6.427	637.33	7.687	16.65
4096	128	12288	6.477	632.37	7.732	16.56
4096	128	16384	6.531	627.21	7.758	16.50
4096	128	20480	6.549	625.48	7.746	16.52
4096	128	24576	6.590	621.54	7.787	16.44
4096	128	28672	6.611	619.53	7.825	16.36
4096	128	32768	6.683	612.86	7.828	16.35
4096	128	36864	6.726	609.02	7.854	16.30
4096	128	40960	6.756	606.28	7.872	16.26
4096	128	45056	6.800	602.35	7.893	16.22
4096	128	49152	6.851	597.90	7.963	16.07
4096	128	53248	6.894	594.15	7.992	16.02
4096	128	57344	6.946	589.67	8.058	15.88
4096	128	61440	7.013	584.05	8.073	15.86

New PR -b 4096 -ub 4096

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	6.354	644.63	7.452	17.18
4096	128	4096	6.378	642.16	7.447	17.19
4096	128	8192	6.432	636.86	7.450	17.18
4096	128	12288	6.490	631.12	7.486	17.10
4096	128	16384	6.543	626.05	7.514	17.04
4096	128	20480	6.565	623.88	7.513	17.04
4096	128	24576	6.612	619.51	7.537	16.98
4096	128	28672	6.634	617.46	7.561	16.93
4096	128	32768	6.710	610.46	7.573	16.90
4096	128	36864	6.755	606.34	7.603	16.84
4096	128	40960	6.779	604.22	7.624	16.79
4096	128	45056	6.834	599.38	7.642	16.75
4096	128	49152	6.878	595.54	7.655	16.72
4096	128	53248	6.927	591.33	7.699	16.62
4096	128	57344	6.964	588.19	7.758	16.50
4096	128	61440	7.032	582.46	7.826	16.36

New PR -b 8192 -ub 8192

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	128	0	9.369	874.33	7.467	17.14
8192	128	8192	9.540	858.66	7.478	17.12
8192	128	16384	9.673	846.89	7.521	17.02
8192	128	24576	9.774	838.11	7.575	16.90
8192	128	32768	9.955	822.89	7.615	16.81
8192	128	40960	10.135	808.32	7.636	16.76
8192	128	49152	10.322	793.61	7.709	16.60
8192	128	57344	10.519	778.80	7.791	16.43

(7800x3d 8 core, ddr5 6000mt, 5090, pcie5)

      ./build/bin/llama-sweep-bench
      -m ~/AI/ik/models/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf
      --slot-save-path ~/AI/ik/slots
      --context-shift on
      -ot "blk\.(?:[0-9]|[1-4][0-9]|[5][0-5])\.ffn.*_exps.*=CPU"
      -c 65536
      -n 128
      -b X -ub X
      -ctk q8_0 -ctv q8_0
      --cache-ram-n-min 128
      --cache-ram-similarity 1
      --slot-prompt-similarity 0.1
      --threads 8 -ngl 95
      -cuda fusion=1,offload-batch-size=4,mmq-id-size=128
      -amb 512
      --host 127.0.0.1
      --port 8080
      --webui none
      --repeat-last-n 2048
      --reasoning-format none --jinja
      --chat-template-file ~/AI/ik3/jinja/qwen-3.5-toolsinject.jinja

MrHills-rs · 2026-02-21T20:54:37Z

One last thing if may. f16 cache seems to be slightly faster then Q8_0 in TG at long context, which is a bit odd.

New PR -ctk f16 -ctv f16

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	6.384	641.55	7.467	17.14
4096	128	4096	6.409	639.09	7.416	17.26
4096	128	8192	6.458	634.25	7.425	17.24
4096	128	12288	6.519	628.36	7.473	17.13
4096	128	16384	6.577	622.81	7.481	17.11
4096	128	20480	6.595	621.08	7.478	17.12
4096	128	24576	6.631	617.68	7.506	17.05
4096	128	28672	6.651	615.88	7.508	17.05
4096	128	32768	6.723	609.23	7.530	17.00
4096	128	36864	6.772	604.84	7.560	16.93
4096	128	40960	6.793	602.95	7.556	16.94
4096	128	45056	6.856	597.39	7.586	16.87
4096	128	49152	6.897	593.85	7.610	16.82
4096	128	53248	6.948	589.52	7.608	16.82
4096	128	57344	6.984	586.49	7.652	16.73
4096	128	61440	7.057	580.39	7.690	16.64

New PR -ctk Q8_0 -ctv Q8_0

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	6.354	644.63	7.452	17.18
4096	128	4096	6.378	642.16	7.447	17.19
4096	128	8192	6.432	636.86	7.450	17.18
4096	128	12288	6.490	631.12	7.486	17.10
4096	128	16384	6.543	626.05	7.514	17.04
4096	128	20480	6.565	623.88	7.513	17.04
4096	128	24576	6.612	619.51	7.537	16.98
4096	128	28672	6.634	617.46	7.561	16.93
4096	128	32768	6.710	610.46	7.573	16.90
4096	128	36864	6.755	606.34	7.603	16.84
4096	128	40960	6.779	604.22	7.624	16.79
4096	128	45056	6.834	599.38	7.642	16.75
4096	128	49152	6.878	595.54	7.655	16.72
4096	128	53248	6.927	591.33	7.699	16.62
4096	128	57344	6.964	588.19	7.758	16.50
4096	128	61440	7.032	582.46	7.826	16.36

      ./build/bin/llama-sweep-bench
      -m ~/AI/ik/models/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf
      --slot-save-path ~/AI/ik/slots
      --context-shift on
      -ot "blk\.(?:[0-9]|[1-4][0-9]|[5][0-5])\.ffn.*_exps.*=CPU"
      -c 65536
      -n 128
      -b 4096 -ub 4096
      -ctk f16 -ctv f16
      --cache-ram-n-min 128
      --cache-ram-similarity 1
      --slot-prompt-similarity 0.1
      --threads 8 -ngl 95
      -cuda fusion=1,offload-batch-size=4,mmq-id-size=128
      -amb 512
      --host 127.0.0.1
      --port 8080
      --webui none
      --repeat-last-n 2048
      --reasoning-format none --jinja
      --chat-template-file ~/AI/ik3/jinja/qwen-3.5-toolsinject.jinja

ubergarm · 2026-02-21T21:16:36Z

@MrHills-rs

One last thing if may. f16 cache seems to be slightly faster then Q8_0 in TG at long context, which is a bit odd.

I've seen similar when offloading on a newish GPU that unquantized f16 kv-cache can be faster and have a less steep drop off as compared to q8_0 kv-cache. iirc it was GLM-4.7 slowed down much faster with q8_0 kv-cache.

In general I do my llama-sweep-bench with unquantized kv-cache when running on GPU, but use q8_0 when running on CPU-only.

ikawrakow · 2026-02-22T06:19:46Z

@MrHills-rs

Have you ever observed quantized KV cache being faster than f16 on CUDA? This would be a really big surprise because quantized KV cache is computed via dequantize to f16 -> use f16 FA implementation, so it is always slower. The extra dequantize step is particularly bad for TG, so TG at long context with quantized KV cache is significantly slower compared to f16 KV cache.

On the CPU it is the other way around, with Q8_0 KV cache being the fastest. This is so because the CPUs I have available do not have native support for f16 arithmetic, so the computation is implemented via conversion to f32 and then f32 arithmetic. But even with native f16 support Q8_0 would be slightly faster because one can process more elements per SIMD instruction. In case your CPU has native bf16 support, bf16 KV cache would be faster than f16, but still slower than Q8_0.

MrHills-rs · 2026-02-22T08:01:47Z

@MrHills-rs

Have you ever observed quantized KV cache being faster than f16 on CUDA? This would be a really big surprise because quantized KV cache is computed via dequantize to f16 -> use f16 FA implementation, so it is always slower. The extra dequantize step is particularly bad for TG, so TG at long context with quantized KV cache is significantly slower compared to f16 KV cache.

On the CPU it is the other way around, with Q8_0 KV cache being the fastest. This is so because the CPUs I have available do not have native support for f16 arithmetic, so the computation is implemented via conversion to f32 and then f32 arithmetic. But even with native f16 support Q8_0 would be slightly faster because one can process more elements per SIMD instruction. In case your CPU has native bf16 support, bf16 KV cache would be faster than f16, but still slower than Q8_0.

Admittedly it's been a very long time since I tested f16 caches since there has been little reason to use it for me. I've only started thinking about it with this model, since both its kv cache memory footprint and tg loss are very small even at long context.

I'm now using a 5090, which has very fast memory for its compute, and my main bottleneck is RAM anyway, so I wasn't expecting much of a difference.
Back when I had a 4090 (20% less compute, but 45% less bandwidth) and MoE wasn't a thing, so all tensors were on VRAM, I used to see a 40% drop in performance going from q4 to q8, and another 30% from q8 to fp16.

Tbh I was surprised too, but I simply assumed that the bottleneck was memory bandwidth rather then computation.

This wasn't on llama.cpp tho. Back then in the GPU only inference days the golden standard for speed was exllamaV2 - much faster for GPU inference albeit slower with mixed inference.

Anyway 262k -c is only 8GB at f16, TG loss at long ctx is completely negated by self speculative decoding, the model seems to be less susceptible to weight quant loss, and ik_llama API now allows to cache kv data on SSD or RAM and force swap slots via ST extensions slash commands, so PP speed is less relevant now. With long context being suddenly so viable F16 makes a lot more sense nowadays, especially since it's a little faster too. I think I'll start including it in tests.

ikawrakow · 2026-02-22T17:12:06Z

Admittedly it's been a very long time since I tested f16 caches

I guess, yes, it has been. The experience that quantized cache gives better performance on CUDA is from the times where FA was implemented with vector or tile kernels. When the implementation came along that uses MMA instructions, which is a long time ago, f16 KV cache is much faster than quantized.

* Better estimate for max. nuber of compute nodes * Just in case

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

Better estimate for max. nuber of compute nodes

d93fe37

ikawrakow mentioned this pull request Feb 21, 2026

Feature Request: Support for Qwen 3.5 #1255

Closed

4 tasks

Just in case

52285a5

ikawrakow merged commit 89b1e2b into main Feb 22, 2026

abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026

Better estimate for max. nuber of compute nodes (ikawrakow#1296)

f650a7a

* Better estimate for max. nuber of compute nodes * Just in case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better estimate for max. nuber of compute nodes#1296

Better estimate for max. nuber of compute nodes#1296
ikawrakow merged 2 commits intomainfrom
ik/max_nodes

ikawrakow commented Feb 21, 2026 •

edited

Loading

Uh oh!

ubergarm commented Feb 21, 2026

default batches +28 exps

-ub 2048 -b 2048 +26 exps

-ub 4096 -b 4096 +24 exps

-ub 8192 -b 8192 +22 exps

-ub 4096 -b 4096 +26 exps ERROR

Uh oh!

MrHills-rs commented Feb 21, 2026

Uh oh!

MrHills-rs commented Feb 21, 2026 •

edited

Loading

Uh oh!

ubergarm commented Feb 21, 2026

Uh oh!

ikawrakow commented Feb 22, 2026

Uh oh!

MrHills-rs commented Feb 22, 2026

Uh oh!

ikawrakow commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Feb 21, 2026

default batches +28 exps

-ub 2048 -b 2048 +26 exps

-ub 4096 -b 4096 +24 exps

-ub 8192 -b 8192 +22 exps

-ub 4096 -b 4096 +26 exps ERROR

Uh oh!

MrHills-rs commented Feb 21, 2026

Uh oh!

MrHills-rs commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Feb 21, 2026

Uh oh!

ikawrakow commented Feb 22, 2026

Uh oh!

MrHills-rs commented Feb 22, 2026

Uh oh!

ikawrakow commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Feb 21, 2026 •

edited

Loading

MrHills-rs commented Feb 21, 2026 •

edited

Loading