ggml: Add initial MXFP6 CPU implementation#22671
Conversation
|
I'd prefer a layout with a larger block size that doesn't require repacking. |
This comment was marked as outdated.
This comment was marked as outdated.
|
Hi @jeffbolznv Thanks for the suggestion, I went back at it and now have a slightly faster implementation, but without using any repack. It's still keeping my preloaded tile idea, which is the fastest, and we can put those right into the gguf and not really an issue. 50+ variations of CUDA trials running all day found best tile was just doubling the layout of the previous tile. (so now 832B). I've got CPU rewritten to use that directly now too. On CUDA, with the bigger block and some new tweaks, it gained about 5% for both pp512/tg128 Qwen3.5-4b. the model loads ~1.5s faster without needing the repack. About 18,200 pp512, 191tg. Qwen3.6-35A-A3B is about 7500/170. |
I don't thing you can have high perf without repacking. And the block size is really related to hardware, tensor sizes. |
|
@Djip007 interestingly enough, the newer "tile-in-gguf" layout on cpu is also faster: |
Didn't know what CPU you have, and if it is for the Qwen3.5-4b. but for me look really slow. _mm512_dpwssd_epi32 can use dot2 on signed int16: so base bloc is A[M/M0][K/K0][M0=16][K0=2] this are for AVX512... |
|
Pushed a new refactored version. No more block layout and no more repack into tiles. The weight and input scale are included in the tile header as well, so there's no more code plumbing needed to load up separate tensors or deal with any derived tensor pathway. They are just easily factored in directly in the vecdot. |
|
nice! did you manage to store the in the gguf file? or have an idea for that? (I think of what we can do for FP8 ... ) |
Yes |
I thing at native E4M3 models, but can be good too. (I had some CPU/RDNA3 kernel for that, but did not finish (start) the gguf file. I "only" had some on load convertion from BF16 model to packed E4M3 / E3M4 , with different scaling schema.
Yes! 👍 |
|
For the weight_scales / input_scales |
|
No, they actually are at the front/top, the header
One |
|
To clarify just a bit more about why this is here: These are placeholders in the GGUF, but not actively used in the CPU version, so it appears it doesn't do anything now. They get filled on the CUDA version only.
|
Ho. That what I do not like (and gg used not too) because this mean model graph depend on quantized type used (ie: all model have to be update). but it is more for #22042 or some other Discussions 1 point : be careful with this header on mmap file... I don't know if you can write it. (for CPU tensors or similare) |
We can certainly make the CPU side do the same as CUDA, it already handles the mmap issue and does much of the math (maybe as you think we could improve this further, as far as coming up with the best GGUF layout). Perhaps I need clean up my CUDA MXPF6 side and post that code to a fork for relevance, but this how I handle that there: |
const ggml_tensor * weight_scale_t = tensor->src[0];
const ggml_tensor * input_scale_t = tensor->src[1];Is the scale define as source of the weight tensor? |
Yes! and This does not mess up any of the other models so there is no need to change anything for anything else (unless improvements for future types) **correction, this was when using repack, after the new layout, newer cuda version doesnt need op_params for a placeholder anymore |
That's what I missed! |
62dd50c to
86d1b6e
Compare
|
Uploaded new MXFP6 models onto HF: To demonstrate how the no-repack tile layout stored in the GGUF can work without big GGML changes, and keeping the scales in the same tile as the CPU version, I posted the CUDA implementation here (not for PR and not clean enough, but as POC): This converts q8_1 tiles to FP8 for MMVQ, so doing native Blackwell MXFP6 x FP8 MMA. This makes MXFP6 almost 20% faster than NVFP4 for tg on this model while keeping the extra quality. Hoping for any feedback on making this even better! The ppl/kld delta on Qwen3.6-27b vs NVFP4 is really striking! MXFP4 vs BF16: and for NVFP4: |
b6f701a to
5789082
Compare
|
Thanks for the excellent MXFP6 work. I was impressed by this PR and the related discussion, so I tried an experimental ROCm implementation This is not PR-ready yet, but I wanted to share some early numbers in case they are useful. Environment:
Speed results: quant size pp128 pp512 pp1024 tg128 Short WikiText-2 perplexity check, ctx 512, chunks 64: quant PPL A few observations from the RDNA4 side:
My current impression is that a hardware-neutral GGUF MXFP6_E2M3 layout plus backend-specific load-time repacking is probably the right I will keep experimenting before publishing code, but I wanted to share these early ROCm/RDNA4 data points. |
Thanks @kominsoo appreciate the kind words. A few questions:
I am actually working on a post now in the discussion forum to talk about the new NVFP4/MXFP6 quantizer with a new technique, you can check that in a bit. It needs the tensor scale to work best, you can give it a try with advanced-gguf-quantizer. I think the best final use case will be combining NVFP4 and MXFP6 to balance out each other as needed. |
|
Thanks, those are exactly the points I need to check next. I did use your mxfp6-cuda branch as reference, but I want to verify whether my |
|
Following up now that I've verified actual generation output, not just bench numbers — and I owe a correction: the Q: Tensor scales? Q: Reference to your CUDA branch? Q: How was TG optimized? Corrected & Optimized Benchmark Numbers (Tested on AMD Radeon AI PRO R9700, gfx1201, ROCm 7.2, |
That's great! Thanks for this data. I was minutes ago just now working on improving NVFP4 tg speeds on the repack version by trying FP8 as I had done with MXFP6, while you replied. I also found that dp4a Q8 was much better for single token than the tile path, the same as you were seeing here, but got around it by making a row shadow for MMVQ. Maybe that could help for ROCm too: |
dbdd3e3 to
d6c4c96
Compare

I'm bringing in this PR an initial implementation with basic
MXFP6-E2M3support for CPU only.MXFP6-E3M2is another FP6 variant; support for that can be added later.Native CUDA/Blackwell support is working and is intended to follow after this PR. It is possible to create native AMD ROCm versions in the future, as well as versions for other backends.
The first model is on HF here: Qwen3.5-4B-MXFP6-GGUF. More are ready and will be uploaded soon.
Related discussion here: #22498
This PR does not include any quantizer besides that used for reference and testing.
Why add MXFP6 into llama.cpp now?
Details
However, essentially there isn't an implementation anywhere that is fast or optimized for it, so it not feasible to use the format (until now). NVIDIA Blackwell GPUs have native hardware blockscaling acceleration - but no kernels with native hardware support were ever released by anyone (as far as I know).
Qwen3.6-35B-A3B-GGUFQ6_K is 28.5GB and has only slightly better quality. An MXFP6 model is approximately 27GB and is faster on prefill, already with very early not-well tuned kernels.Qwen3.5-4B PPL/KLD/Speed Report:
Details
CPU final PPL vs BF16 had a difference of-0.012933(CPU ):Final estimate: PPL = 9.9361 +/- 0.07125(CUDA):Mean PPL(Q): 9.904247 ± 0.070767BF16:Mean PPL(Q): 9.949033 ± 0.071570The negative ppl score is odd but likely because of tuning and running ppl on the wiki2 dataset.CUDA Final Ppl Delta vs BF16:0.044786(Note, MXFP6xMXFP6 loss during MMQ vs MXFP6xQ8)~~CUDA Final Ppl Delta vs CPU:
0.031853~~Full KL Divergence was only done on CUDA for speed consideration. CUDA is outside scope of this PR but is the primary intended use target, so results shown for reference for what to expect. Q8 and Q6_K still are better in quality, but MXFP6 win on prefill speed (especially combined with NVFP4) and there is still much more room to improve both.
Q8
pp512: 16802, tg128: :200:'Ppl:9.956220,mean:1.001236, Mean kld: 0.001959,max: 11.359487, top p: 97.554%, RMS Δp 1.299%`
MXFP6
pp512: 17822, tg128: 180:Ppl:9.904247,mean:0.996009, Mean kld: 0.021651,max: 17.561033, top p: 92.635%, RMS Δp 4.026%MXFP6 (188 MXFP6 layers, 13 NVFP4 )
pp512: 21166, tg128: 242Ppl: 10.161346,mean: 1.021864 , Mean kld:0.045214,max: 21.575535, top p: 88.191%, RMS Δp 5.787%NVFP4 (Same imatrix/tuner as above, plus 4over6):
pp512: 21312, tg128: 245Ppl:10.287908,mean:1.034592, Mean kld: 0.082396,max: 21.185059, top p: 86.785,RMS Δp 7.725%Q6_K:
pp512: 14341, tg128: 227Ppl: 9.992565,mean: 1.004891, Mean kld: 0.005779, max: 10.447377, top p:96.218%, RMS Δp 2.194%NVFP4 (Converted HF/ModelOpt):
pp512: 20239, tg128: 224Ppl:10.838460,mean:1.089958, Mean kld: 0.104422,max: 21.996130, top p: 85.184%,RMS Δp 8.815%Q4_K:
pp512: 16137, tg128: 283Ppl:10.395488,mean:1.045411, Mean kld: 0.046251,max: 19.578995, top p: 90.415%,RMS Δp 5.772As seen from above data, there is a lot of promise for how we can use and leverage MXFP6. There remains much undetermined or near infinite flexibility how to optimize
MXFP6alone or withNVFP4orQ8to balance speed vs quality. Let's stay focused for now, but I have already combined NVFP4/MXFP6 with MXFP8, and using FP8 activations with the.f4f6f8mma can further improve speed.Newest Full Qwen3.5-4B kld log:
Details
Block design with Repack:Details
The on disk GGUF layout is as follows:You might think this seems inefficient, especially for the 8 bit containers that Blackwell needs. But there are a few reasons it was kept this way. It is necessary to explain the CUDA block design now (outside the scope of this PR, but relevant since that is the intended use case) to explain.~~The CUDA layout is dynamic and is decided at loadtime with a fast repack. It is using the same proposed NVFP4/Blackwell "repack-mma into Blackwell layout tiles" as shown elsewhere.
So for the gguf layout,
block_mxfp6is kept row major, and as small as possible. It can be repacked the same way on load to other backends and hardware designs in the future without needing to ever change the disk layout again; the CUDA layout can also be improved upon and changed over time as new hardware or superior optimizations come up with something better. ~~~~For MXFP6, a 3 lane 416 byte tile was tested and determined to be the most optimal for now:
block_mxfp6into tiles during inferenceEg, for
Qwen3.6-35A3B-A3B-MXFP6:A perfect AoSoA aligned 544-byte tile needs a 34609 MiB CUDA buffer. For a 32606 MiB card this offloads. On the 416-byte tile layout, this is 26,467 MiB/CUDA, 379 MiB/CPU. Testing shows staging to a tile would be 2.255ns/mma vs 1.485 ns/mma, so about 30% slower. It would add complexity but a future potential option would be to determine the VRAM at loading, then choosing which what tile layout to use depending on the model size.
So the tile version is:~~
How will this work when the Blackwell MMA PTX needs an 8-bit container?
To get around this "2 bits taking up VRAM" with 0s, immediately at MMA we insert the 0s at that moment:~~
So the required padding is added to registers just in time, but the rest of the time in VRAM is the 3 lane compact form. The optional tensor scales are applied as derived tensors directly into the vecdot, protecting precision and quality.Detailed experimental results on layout choices and speed:Faster Tile Layout Now in GGUF, No Repack Needed
Details
Refactored version now maintains the tile layout directly on disk. Repack is not necessary; the layout is already interleaved and quantized directly into tiles ready to go for MMA. The CPU version reads from the tile layout. Small adapter keeps this working with GGML functions expecting block/rows. Extensive testing thus far found this to be the fastest layout on CUDA and no traditional `load_tiles`, `load_ldmatrix`, `cpasync` etc is needed, the tile goes directly into registers. ``` define QK_MXFP6 64 #define QK_MXFP6_SUB 32 #define QK_MXFP6_PACKED_BYTES 24 #define MXFP6_TILE_ROWS 16 #define MXFP6_TILE_FRAGS 2 #define MXFP6_TILE_LANES 32 #define MXFP6_TILE_PAYLOADS 3 #define MXFP6_TILE_BYTES 832 #define MXFP6_ROW_BYTES (MXFP6_TILE_BYTES / MXFP6_TILE_ROWS)typedef struct GGML_ALIGN(16) {
uint32_t lane[MXFP6_TILE_LANES][MXFP6_TILE_PAYLOADS];
uint8_t scale[MXFP6_TILE_LANES];
} tile_mxfp6_frag;
typedef struct GGML_ALIGN(16) {
tile_mxfp6_frag frag[MXFP6_TILE_FRAGS];
} tile_mxfp6;
struct ggml_tensor;
typedef struct {
const struct ggml_tensor * tensor;
const void * tile;
int64_t row;
int64_t channel;
} ggml_tile_to_row_ref;
typedef struct GGML_ALIGN(16) {
float weight_scale;
float input_scale;
const float * weight_scales;
const float * input_scales;
#if !defined(__cplusplus)
tile_mxfp6 tiles[];
#endif
} tensor_mxfp6;
./test-backend-ops -p "mxfp6" -b CPU
... previous tests ...
ch_dims=[1,1]): OK
MUL_MAT_VEC_FUSION(type=mxfp6_e2m3,glu_op=1,m=1,n=32,k=256,use_id=1,n_mats=16,n_used=8,b=1,with_bias=1,with_gate=1,batch_dims=[4,2]): OK
MUL_MAT_VEC_FUSION(type=mxfp6_e2m3,glu_op=1,m=1,n=32,k=256,use_id=1,n_mats=16,n_used=8,b=1,with_bias=1,with_gate=1,batch_dims=[1,1]): OK
192/192 tests passed
Backend CPU: OK
2/2 backends passed
OK
.Ran both ci-cpu and ci-cuda to check the model prior to posting.
100% tests passed, 0 tests failed out of 44
....etc...
real 0m35.022s
user 0m42.320s
sys 0m8.429s
Label Time Summary:
main = 23.61 sec*proc (42 tests)
..... PASS