Skip to content

Better estimate for max. nuber of compute nodes#1296

Merged
ikawrakow merged 2 commits intomainfrom
ik/max_nodes
Feb 22, 2026
Merged

Better estimate for max. nuber of compute nodes#1296
ikawrakow merged 2 commits intomainfrom
ik/max_nodes

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Feb 21, 2026

Now you can have whatever u-batch size you want with Qwen3-Next and Qwen-3.5-MoE (or any other supported model).

@ubergarm
Copy link
Contributor

Giving this a try with hybrid CPU+GPU with ubergarm/Qwen3.5-397B-A17B Q3_K 179.97 GiB (3.90 BPW)using -sm layer (as no -sm graph for qwen35moe arch yet psure).

Interestingly, seeing increasing batch sizes improving PP fairly significantly despite offloading less routed experts to VRAM. There is a little loss to TG speeds likely due to less weights on VRAM to make room for bigger batch sizes.

sweep-bench-Qwen3 5-397B-A17B-Q3_K-PR1296

I did see one new error when increasing batch sizes to 4096, but didn't have quite enough VRAM available probably, details below.

CUDA error: an unsupported value or parameter was passed to the function
  current device: 1, in function ggml_cuda_op_mul_mat_cublas at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:1486
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error
👈 Details

default batches +28 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 40960 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(46|47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 64 0 2.385 214.70 2.653 24.12
512 64 512 2.407 212.69 2.614 24.49
512 64 1024 2.342 218.64 2.644 24.20
512 64 1536 1.963 260.80 2.624 24.39
512 64 2048 2.526 202.70 2.605 24.57
512 64 2560 2.176 235.27 2.589 24.72
512 64 3072 2.132 240.18 2.576 24.84
512 64 3584 2.535 201.96 2.695 23.74
512 64 4096 1.919 266.80 2.589 24.72
512 64 4608 2.525 202.77 2.589 24.72
512 64 5120 2.527 202.61 2.668 23.99
512 64 5632 2.477 206.70 2.639 24.25
512 64 6144 2.392 214.03 2.655 24.11
512 64 6656 2.513 203.73 2.714 23.58
512 64 7168 1.972 259.66 2.640 24.24
512 64 7680 1.921 266.51 2.595 24.66
512 64 8192 1.918 266.98 2.603 24.59
512 64 8704 1.998 256.26 2.600 24.62
512 64 9216 1.925 265.95 2.614 24.48
512 64 9728 2.535 202.01 2.611 24.51
512 64 10240 2.497 205.05 2.655 24.11
512 64 10752 2.375 215.59 2.642 24.23
512 64 11264 1.977 258.98 2.638 24.26
512 64 11776 2.087 245.32 2.611 24.51
512 64 12288 2.358 217.17 2.672 23.96
512 64 12800 2.227 229.94 2.705 23.66
512 64 13312 2.585 198.08 2.692 23.77
512 64 13824 1.956 261.79 2.683 23.85
512 64 14336 1.949 262.71 2.630 24.33
512 64 14848 2.181 234.80 2.664 24.02
512 64 15360 2.532 202.23 2.628 24.35
512 64 15872 1.969 260.05 2.631 24.32
512 64 16384 2.411 212.33 2.667 24.00
512 64 16896 2.470 207.28 2.691 23.78
512 64 17408 1.975 259.26 2.640 24.25
512 64 17920 2.501 204.74 2.651 24.14
512 64 18432 2.281 224.50 2.742 23.34
512 64 18944 2.516 203.50 2.638 24.26
512 64 19456 2.639 194.03 2.649 24.16
512 64 19968 2.042 250.70 2.683 23.85
512 64 20480 2.495 205.24 2.735 23.40
512 64 20992 2.014 254.16 2.690 23.80
512 64 21504 2.146 238.58 2.651 24.14
512 64 22016 1.978 258.83 2.676 23.92
512 64 22528 2.437 210.13 2.687 23.82
512 64 23040 1.999 256.08 2.699 23.72
512 64 23552 2.566 199.57 2.663 24.04
512 64 24064 1.985 257.87 2.668 23.98
512 64 24576 1.993 256.93 2.661 24.05
512 64 25088 2.533 202.17 2.669 23.98
512 64 25600 2.203 232.39 2.715 23.57
512 64 26112 2.514 203.65 2.756 23.22
512 64 26624 2.556 200.29 2.700 23.70
512 64 27136 2.652 193.07 2.717 23.56
512 64 27648 2.481 206.41 2.720 23.53
512 64 28160 2.409 212.57 2.724 23.49
512 64 28672 2.186 234.21 2.743 23.33
512 64 29184 2.009 254.87 2.683 23.85
512 64 29696 2.413 212.22 2.712 23.60
512 64 30208 2.455 208.53 2.737 23.39
512 64 30720 2.017 253.90 2.703 23.68
512 64 31232 2.469 207.39 2.682 23.86
512 64 31744 2.610 196.16 2.777 23.05
512 64 32256 2.346 218.24 2.779 23.03
512 64 32768 2.624 195.14 2.723 23.50
512 64 33280 2.043 250.60 2.781 23.01
512 64 33792 2.730 187.54 2.711 23.61
512 64 34304 2.059 248.72 2.744 23.32
512 64 34816 2.681 190.96 2.717 23.56

-ub 2048 -b 2048 +26 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 36864 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 2048 -b 2048 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 64 0 5.870 348.90 2.744 23.32
2048 64 2048 5.918 346.07 2.626 24.37
2048 64 4096 5.987 342.06 2.635 24.29
2048 64 6144 6.044 338.85 2.793 22.92
2048 64 8192 6.029 339.70 2.677 23.91
2048 64 10240 6.092 336.16 2.660 24.06
2048 64 12288 6.138 333.64 2.708 23.64
2048 64 14336 6.164 332.24 2.793 22.91
2048 64 16384 6.241 328.16 2.693 23.77
2048 64 18432 6.218 329.37 2.674 23.93
2048 64 20480 6.271 326.57 2.666 24.01
2048 64 22528 6.286 325.83 2.768 23.13
2048 64 24576 6.325 323.80 2.707 23.64
2048 64 26624 6.346 322.70 2.692 23.77
2048 64 28672 6.378 321.09 2.700 23.71
2048 64 30720 6.434 318.30 2.740 23.35
2048 64 32768 6.467 316.67 2.753 23.24
2048 64 34816 6.520 314.11 2.754 23.24

-ub 4096 -b 4096 +24 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 36864 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 4096 -b 4096 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 8.990 455.60 2.757 23.22
4096 64 4096 9.185 445.93 2.658 24.08
4096 64 8192 9.259 442.36 2.721 23.52
4096 64 12288 9.380 436.68 2.662 24.04
4096 64 16384 9.508 430.80 2.750 23.28
4096 64 20480 9.656 424.18 2.724 23.49
4096 64 24576 9.747 420.23 2.716 23.57
4096 64 28672 9.898 413.81 2.734 23.41
4096 64 32768 10.023 408.66 2.795 22.90

-ub 8192 -b 8192 +22 exps

./build/bin/llama-sweep-bench \
    --model "$model"\
    --ctx-size 40960 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 8192 -b 8192 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 64 0 15.743 520.35 2.744 23.32
8192 64 8192 16.385 499.98 2.739 23.36
8192 64 16384 16.712 490.18 2.772 23.09
8192 64 24576 17.205 476.13 2.821 22.69
8192 64 32768 17.703 462.76 2.795 22.89

-ub 4096 -b 4096 +26 exps ERROR

./build/bin/llama-sweep-bench \
    --model "$model" \
    --ctx-size 36864 \
    -ger \
    --merge-qkv \
    -sm layer \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0" \
    -ot "blk\.(47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
    -ub 4096 -b 4096 \
    --cpu-moe \
    --threads 24 \
    --no-mmap \
    --warmup-batch \
    -n 64
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 8.756 467.79 2.714 23.59
4096 64 4096 8.964 456.93 2.639 24.25
4096 64 8192 9.016 454.33 2.673 23.95
CUDA error: an unsupported value or parameter was passed to the function
  current device: 1, in function ggml_cuda_op_mul_mat_cublas at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:1486
  cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error

@MrHills-rs
Copy link

Using -ub 8192 shows significant improvement in PP for me.

Interestingly, I see a small uplift in TG (4%) using the same parameters in the new PR.

Main (3 hours old pr) -b 4096 -ub 4096 66323b9

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 6.347 645.33 7.677 16.67
4096 128 4096 6.365 643.47 7.693 16.64
4096 128 8192 6.427 637.33 7.687 16.65
4096 128 12288 6.477 632.37 7.732 16.56
4096 128 16384 6.531 627.21 7.758 16.50
4096 128 20480 6.549 625.48 7.746 16.52
4096 128 24576 6.590 621.54 7.787 16.44
4096 128 28672 6.611 619.53 7.825 16.36
4096 128 32768 6.683 612.86 7.828 16.35
4096 128 36864 6.726 609.02 7.854 16.30
4096 128 40960 6.756 606.28 7.872 16.26
4096 128 45056 6.800 602.35 7.893 16.22
4096 128 49152 6.851 597.90 7.963 16.07
4096 128 53248 6.894 594.15 7.992 16.02
4096 128 57344 6.946 589.67 8.058 15.88
4096 128 61440 7.013 584.05 8.073 15.86

New PR -b 4096 -ub 4096

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 6.354 644.63 7.452 17.18
4096 128 4096 6.378 642.16 7.447 17.19
4096 128 8192 6.432 636.86 7.450 17.18
4096 128 12288 6.490 631.12 7.486 17.10
4096 128 16384 6.543 626.05 7.514 17.04
4096 128 20480 6.565 623.88 7.513 17.04
4096 128 24576 6.612 619.51 7.537 16.98
4096 128 28672 6.634 617.46 7.561 16.93
4096 128 32768 6.710 610.46 7.573 16.90
4096 128 36864 6.755 606.34 7.603 16.84
4096 128 40960 6.779 604.22 7.624 16.79
4096 128 45056 6.834 599.38 7.642 16.75
4096 128 49152 6.878 595.54 7.655 16.72
4096 128 53248 6.927 591.33 7.699 16.62
4096 128 57344 6.964 588.19 7.758 16.50
4096 128 61440 7.032 582.46 7.826 16.36

New PR -b 8192 -ub 8192

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 128 0 9.369 874.33 7.467 17.14
8192 128 8192 9.540 858.66 7.478 17.12
8192 128 16384 9.673 846.89 7.521 17.02
8192 128 24576 9.774 838.11 7.575 16.90
8192 128 32768 9.955 822.89 7.615 16.81
8192 128 40960 10.135 808.32 7.636 16.76
8192 128 49152 10.322 793.61 7.709 16.60
8192 128 57344 10.519 778.80 7.791 16.43

(7800x3d 8 core, ddr5 6000mt, 5090, pcie5)

      ./build/bin/llama-sweep-bench
      -m ~/AI/ik/models/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf
      --slot-save-path ~/AI/ik/slots
      --context-shift on
      -ot "blk\.(?:[0-9]|[1-4][0-9]|[5][0-5])\.ffn.*_exps.*=CPU"
      -c 65536
      -n 128
      -b X -ub X
      -ctk q8_0 -ctv q8_0
      --cache-ram-n-min 128
      --cache-ram-similarity 1
      --slot-prompt-similarity 0.1
      --threads 8 -ngl 95
      -cuda fusion=1,offload-batch-size=4,mmq-id-size=128
      -amb 512
      --host 127.0.0.1
      --port 8080
      --webui none
      --repeat-last-n 2048
      --reasoning-format none --jinja
      --chat-template-file ~/AI/ik3/jinja/qwen-3.5-toolsinject.jinja

@MrHills-rs
Copy link

MrHills-rs commented Feb 21, 2026

One last thing if may. f16 cache seems to be slightly faster then Q8_0 in TG at long context, which is a bit odd.

New PR -ctk f16 -ctv f16

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 6.384 641.55 7.467 17.14
4096 128 4096 6.409 639.09 7.416 17.26
4096 128 8192 6.458 634.25 7.425 17.24
4096 128 12288 6.519 628.36 7.473 17.13
4096 128 16384 6.577 622.81 7.481 17.11
4096 128 20480 6.595 621.08 7.478 17.12
4096 128 24576 6.631 617.68 7.506 17.05
4096 128 28672 6.651 615.88 7.508 17.05
4096 128 32768 6.723 609.23 7.530 17.00
4096 128 36864 6.772 604.84 7.560 16.93
4096 128 40960 6.793 602.95 7.556 16.94
4096 128 45056 6.856 597.39 7.586 16.87
4096 128 49152 6.897 593.85 7.610 16.82
4096 128 53248 6.948 589.52 7.608 16.82
4096 128 57344 6.984 586.49 7.652 16.73
4096 128 61440 7.057 580.39 7.690 16.64

New PR -ctk Q8_0 -ctv Q8_0

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 6.354 644.63 7.452 17.18
4096 128 4096 6.378 642.16 7.447 17.19
4096 128 8192 6.432 636.86 7.450 17.18
4096 128 12288 6.490 631.12 7.486 17.10
4096 128 16384 6.543 626.05 7.514 17.04
4096 128 20480 6.565 623.88 7.513 17.04
4096 128 24576 6.612 619.51 7.537 16.98
4096 128 28672 6.634 617.46 7.561 16.93
4096 128 32768 6.710 610.46 7.573 16.90
4096 128 36864 6.755 606.34 7.603 16.84
4096 128 40960 6.779 604.22 7.624 16.79
4096 128 45056 6.834 599.38 7.642 16.75
4096 128 49152 6.878 595.54 7.655 16.72
4096 128 53248 6.927 591.33 7.699 16.62
4096 128 57344 6.964 588.19 7.758 16.50
4096 128 61440 7.032 582.46 7.826 16.36
      ./build/bin/llama-sweep-bench
      -m ~/AI/ik/models/Qwen3.5-397B-A17B-smol-IQ2_XS.gguf
      --slot-save-path ~/AI/ik/slots
      --context-shift on
      -ot "blk\.(?:[0-9]|[1-4][0-9]|[5][0-5])\.ffn.*_exps.*=CPU"
      -c 65536
      -n 128
      -b 4096 -ub 4096
      -ctk f16 -ctv f16
      --cache-ram-n-min 128
      --cache-ram-similarity 1
      --slot-prompt-similarity 0.1
      --threads 8 -ngl 95
      -cuda fusion=1,offload-batch-size=4,mmq-id-size=128
      -amb 512
      --host 127.0.0.1
      --port 8080
      --webui none
      --repeat-last-n 2048
      --reasoning-format none --jinja
      --chat-template-file ~/AI/ik3/jinja/qwen-3.5-toolsinject.jinja

@ubergarm
Copy link
Contributor

@MrHills-rs

One last thing if may. f16 cache seems to be slightly faster then Q8_0 in TG at long context, which is a bit odd.

I've seen similar when offloading on a newish GPU that unquantized f16 kv-cache can be faster and have a less steep drop off as compared to q8_0 kv-cache. iirc it was GLM-4.7 slowed down much faster with q8_0 kv-cache.

In general I do my llama-sweep-bench with unquantized kv-cache when running on GPU, but use q8_0 when running on CPU-only.

@ikawrakow
Copy link
Owner Author

@MrHills-rs

Have you ever observed quantized KV cache being faster than f16 on CUDA? This would be a really big surprise because quantized KV cache is computed via dequantize to f16 -> use f16 FA implementation, so it is always slower. The extra dequantize step is particularly bad for TG, so TG at long context with quantized KV cache is significantly slower compared to f16 KV cache.

On the CPU it is the other way around, with Q8_0 KV cache being the fastest. This is so because the CPUs I have available do not have native support for f16 arithmetic, so the computation is implemented via conversion to f32 and then f32 arithmetic. But even with native f16 support Q8_0 would be slightly faster because one can process more elements per SIMD instruction. In case your CPU has native bf16 support, bf16 KV cache would be faster than f16, but still slower than Q8_0.

@MrHills-rs
Copy link

@MrHills-rs

Have you ever observed quantized KV cache being faster than f16 on CUDA? This would be a really big surprise because quantized KV cache is computed via dequantize to f16 -> use f16 FA implementation, so it is always slower. The extra dequantize step is particularly bad for TG, so TG at long context with quantized KV cache is significantly slower compared to f16 KV cache.

On the CPU it is the other way around, with Q8_0 KV cache being the fastest. This is so because the CPUs I have available do not have native support for f16 arithmetic, so the computation is implemented via conversion to f32 and then f32 arithmetic. But even with native f16 support Q8_0 would be slightly faster because one can process more elements per SIMD instruction. In case your CPU has native bf16 support, bf16 KV cache would be faster than f16, but still slower than Q8_0.

Admittedly it's been a very long time since I tested f16 caches since there has been little reason to use it for me. I've only started thinking about it with this model, since both its kv cache memory footprint and tg loss are very small even at long context.

I'm now using a 5090, which has very fast memory for its compute, and my main bottleneck is RAM anyway, so I wasn't expecting much of a difference.
Back when I had a 4090 (20% less compute, but 45% less bandwidth) and MoE wasn't a thing, so all tensors were on VRAM, I used to see a 40% drop in performance going from q4 to q8, and another 30% from q8 to fp16.

Tbh I was surprised too, but I simply assumed that the bottleneck was memory bandwidth rather then computation.

This wasn't on llama.cpp tho. Back then in the GPU only inference days the golden standard for speed was exllamaV2 - much faster for GPU inference albeit slower with mixed inference.

Anyway 262k -c is only 8GB at f16, TG loss at long ctx is completely negated by self speculative decoding, the model seems to be less susceptible to weight quant loss, and ik_llama API now allows to cache kv data on SSD or RAM and force swap slots via ST extensions slash commands, so PP speed is less relevant now. With long context being suddenly so viable F16 makes a lot more sense nowadays, especially since it's a little faster too. I think I'll start including it in tests.

@ikawrakow
Copy link
Owner Author

Admittedly it's been a very long time since I tested f16 caches

I guess, yes, it has been. The experience that quantized cache gives better performance on CUDA is from the times where FA was implemented with vector or tile kernels. When the implementation came along that uses MMA instructions, which is a long time ago, f16 KV cache is much faster than quantized.

@ikawrakow ikawrakow merged commit 89b1e2b into main Feb 22, 2026
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants