Faster quantization for MoE models with many experts by ikawrakow · Pull Request #1322 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-25T18:10:03Z

While working on #1321 I was bothered by the slow quantization of the MoE tensors. Based on simple napkin math, it seemed MoE tensors take significantly longer per model weight compared to other tensors. The Qwen3-Coder-Next model I was working with has a very large number of experts (512), but each expert is quite small (2048 x 512). The way quantization is done on the main branch, a new set of threads is started for each expert. The CPU I'm currently working on has 64 threads, so these are 512 x 64 thread launches per MoE tensor, so 512 x 64 x 3 x 48 = 4,718,592 thread launches to quantize the experts tensors in this model. With the system under heavy load (as it is during quantization), launching 4+ million threads is not exactly cheap.

This PR is a quick band-aid, hopefully making @ubergarm's life easier while preparing the quantized models for the new Qwen3 entries.

Instead of launching N threads per expert, we now let each thread quantize a whole expert. This reduces the number of thread launches by 512 for Qwen3-Coder-Next. As an example of the effect, quantization time for IQ4_K becomes about 300 seconds down from 520 seconds on the main branch.

I'm calling it a band-aid because there is no attempt made to accommodate people who like using strange number of threads (I have seen 13, 47, and such). The optimization will only take effect if the number of experts is either an exact multiple of the number of threads, or if there are at least two times more experts than threads.

ubergarm · 2026-02-25T19:36:37Z

AMD EPYC 9755 128-Core Processor using 128 threads on single socket making that Qwen3-Coder-Next IQ1_KT quant:

Before PR
- main: quantize time = 696140.13 ms
With PR
- main: quantize time = 407093.06 ms

So from 11.60 minutes down to 6.78 minutes is very nice 1.71x speedup here!

One more test for little Qwen3.5-35B-A3B Q8_0

Before PR
- main: quantize time = 74079.11 ms
With PR
- main: quantize time = 19256.94 ms

So from 1.23 minutes down to 0.32 minutes for 3.85x speedup! (i even ran the "before PR" main branch shortly after the fast one, so not due just to disk cache stuff etc.

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

Faster quantization for MoE models with many experts

4b840a3

ikawrakow merged commit 87b35da into main Feb 26, 2026

abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026

Faster quantization for MoE models with many experts (ikawrakow#1322)

4753e32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster quantization for MoE models with many experts#1322

Faster quantization for MoE models with many experts#1322
ikawrakow merged 1 commit intomainfrom
ik/faster_moe_quantize

ikawrakow commented Feb 25, 2026

Uh oh!

ubergarm commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikawrakow commented Feb 25, 2026

Uh oh!

ubergarm commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ubergarm commented Feb 25, 2026 •

edited

Loading