Better estimate for max. nuber of compute nodes#1296
Conversation
|
Giving this a try with hybrid CPU+GPU with ubergarm/Qwen3.5-397B-A17B Q3_K 179.97 GiB (3.90 BPW)using Interestingly, seeing increasing batch sizes improving PP fairly significantly despite offloading less routed experts to VRAM. There is a little loss to TG speeds likely due to less weights on VRAM to make room for bigger batch sizes.
I did see one new error when increasing batch sizes to 4096, but didn't have quite enough VRAM available probably, details below. CUDA error: an unsupported value or parameter was passed to the function
current device: 1, in function ggml_cuda_op_mul_mat_cublas at /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:1486
cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda.cu:131: CUDA error👈 Detailsdefault batches +28 exps./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 40960 \
-ger \
--merge-qkv \
-sm layer \
-ngl 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(46|47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe \
--threads 24 \
--no-mmap \
--warmup-batch \
-n 64
-ub 2048 -b 2048 +26 exps./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 36864 \
-ger \
--merge-qkv \
-sm layer \
-ngl 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
-ub 2048 -b 2048 \
--cpu-moe \
--threads 24 \
--no-mmap \
--warmup-batch \
-n 64
-ub 4096 -b 4096 +24 exps./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 36864 \
-ger \
--merge-qkv \
-sm layer \
-ngl 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
-ub 4096 -b 4096 \
--cpu-moe \
--threads 24 \
--no-mmap \
--warmup-batch \
-n 64
-ub 8192 -b 8192 +22 exps./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 40960 \
-ger \
--merge-qkv \
-sm layer \
-ngl 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(49|50|51|52|53|54|55|56|57|58|59)\.ffn_(gate|up|down)_exps.*=CUDA1" \
-ub 8192 -b 8192 \
--cpu-moe \
--threads 24 \
--no-mmap \
--warmup-batch \
-n 64
-ub 4096 -b 4096 +26 exps ERROR
|
|
Using -ub 8192 shows significant improvement in PP for me. Interestingly, I see a small uplift in TG (4%) using the same parameters in the new PR. Main (3 hours old pr) -b 4096 -ub 4096 66323b9
New PR -b 4096 -ub 4096
New PR -b 8192 -ub 8192
(7800x3d 8 core, ddr5 6000mt, 5090, pcie5) |
|
One last thing if may. f16 cache seems to be slightly faster then Q8_0 in TG at long context, which is a bit odd. New PR -ctk f16 -ctv f16
New PR -ctk Q8_0 -ctv Q8_0
|
I've seen similar when offloading on a newish GPU that unquantized f16 kv-cache can be faster and have a less steep drop off as compared to q8_0 kv-cache. iirc it was GLM-4.7 slowed down much faster with q8_0 kv-cache. In general I do my llama-sweep-bench with unquantized kv-cache when running on GPU, but use q8_0 when running on CPU-only. |
|
Have you ever observed quantized KV cache being faster than On the CPU it is the other way around, with |
Admittedly it's been a very long time since I tested f16 caches since there has been little reason to use it for me. I've only started thinking about it with this model, since both its kv cache memory footprint and tg loss are very small even at long context. I'm now using a 5090, which has very fast memory for its compute, and my main bottleneck is RAM anyway, so I wasn't expecting much of a difference. Tbh I was surprised too, but I simply assumed that the bottleneck was memory bandwidth rather then computation. This wasn't on llama.cpp tho. Back then in the GPU only inference days the golden standard for speed was exllamaV2 - much faster for GPU inference albeit slower with mixed inference. Anyway 262k -c is only 8GB at f16, TG loss at long ctx is completely negated by self speculative decoding, the model seems to be less susceptible to weight quant loss, and ik_llama API now allows to cache kv data on SSD or RAM and force swap slots via ST extensions slash commands, so PP speed is less relevant now. With long context being suddenly so viable F16 makes a lot more sense nowadays, especially since it's a little faster too. I think I'll start including it in tests. |
I guess, yes, it has been. The experience that quantized cache gives better performance on CUDA is from the times where FA was implemented with vector or tile kernels. When the implementation came along that uses MMA instructions, which is a long time ago, |
* Better estimate for max. nuber of compute nodes * Just in case
* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

Now you can have whatever u-batch size you want with Qwen3-Next and Qwen-3.5-MoE (or any other supported model).