Using MXFP6 to improve NVFP4 on llama.cpp #22498

michaelw9999 · 2026-04-29T04:25:46Z

michaelw9999
Apr 29, 2026

MXFP6 is something I've been exploring along with NVFP4, so I wanted to share some data from some experiments I did and hear thoughts/consensus.

It seems this rarely discussed (but vLLM has support) and that MXFP6 is yet another quant, but it is supported on Blackwell along with AMD datacenter MI350 and is an official OCP microscaling standard, so its use case is multi platform. I'm not sure about AMD or Intel's future plans for incorporating it on consumer GPUs. There's a nice AMD MXFP6 read up here so it seems they are promoting it, or at least were. External studies have shown FP4/FP6 combinations can seriously improve quality over NVFP4 alone, and possibly without much performance loss, and I wanted to see if that was something we could do on llama.cpp to NVFP4 with Blackwell hardware.

PR #16777 looked at bringing in an MXFP6_MoE type about 6 months ago. That never materialized and a question at the time was "what is the benefit?" That is hard to argue: Right now, there's essentially no models available for it - huggingface has 3 outdated, so this remains at the moment quite "useless". That was then and still remains true.

But, now that we have NVFP4, MXFP6 could be "next" for keeping fast performance and further improving model quality and could either stand on its own, or used in combination with NVFP4 (or anything else).

I created some working POC implementations to see what the data might show as to its potential usefulness. Consider these preliminary/unoptimized results just for deciding whether to pursue further. I used llama-quantize with Qwen3.5-4B, no imatrix, and then quantized MXFP8(U8M0), MXFP6 (E2M3), and a combined NVFP4/MXFP6, and then ran them against the same BF16 kld along with Q4_K, Q6_K, and Q8.
MMQ was kept as NVFP4 x NVFP4, or MXFP6 x MXFP6, or MXFP8 x MXFP8; MMVQ was kept as NVFP4/MXFP6/MXFP8 x Q8.

Type	Mean PPL	Mean KLD	Same top-p
NVFP4	11.557739*	0.091930	86.667%
NVFP4-MXFP6	11.279966	0.095233	85.797%
Q4_K	11.238092	0.041044	91.176%
MXFP6	10.909077	0.014601	93.750%
MXFP8	10.860777	0.013564	94.056%
Q6_K	10.819785	0.006159	96.483%
Q8_0	10.817675	0.002344	97.537%

*11.55 from future NVFP4 version
On the NVFP4-MXFP6 mixed precision, I moved just a few layers off NVFP4 to MXFP6 as an adhoc experiment just as a first quick test. Note reduction in ppl; kld/top-p went down, .

NVFP4 here is the prefill speed winner followed by NVFP4/MXFP6.

Type	Size	pp512	tg128
NVFP4	3.06 GiB	20228.69 ± 81.76	221.54 ± 0.72
NVFP4-MXFP6	3.10 GiB	18929.66 ± 33.14	182.52 ± 5.99
Q8_0	4.16 GiB	17916.18 ± 13.14	210.31 ± 0.63
MXFP6	3.10 GiB	17203.35 ± 35.06	145.61 ± 5.29
Q4_K	2.51 GiB	17042.89 ± 57.17	270.72 ± 1.49
Q6_K	3.22 GiB	15226.51 ± 20.72	239.46 ± 0.41
MXFP8	4.16 GiB	14826.69 ± 41.53	182.81 ± 0.53

*NVFP4 and MXFP6 via repack AoSoA
*MXFP8 mma via f32.e4m3.e4m3.f32.ue8m0
*MXFP6 mma via f32.e2m3.e2m3.f32.ue8m0

It appears from this preliminary data that MXFP6 is a strong contender with much better ppl/kld than NVFP4, without much increase in model size (depends on which layers; MXFP6=6.25BPW), and already without any effort to tune, was faster than Q6/Q4_K (ignoring tg which needs optimizing). So it's certainly feasible we can improve NVFP4 model quality by keeping some layers as MXFP6, without significantly increasing its size or losing performance (at least from this one example), and could be worth pursuing. Conversely, some tensors that would not otherwise be suitable to quantize down to NVFP4 now could work as MXFP6, perhaps making the model smaller, so testing would be needed to figure out a perfect balance.

am17an · 2026-04-29T16:07:42Z

am17an
Apr 29, 2026
Collaborator

From what you have posted it looks like Q4_K already provides better quantization benefits with roughly the same PP and vastly better TG.

2 replies

michaelw9999 Apr 29, 2026
Author

Q4_K is fast in isolation, but with all else being the same on the nominal quantizer settings (eg , not messing with MOSTLY/new_type shifts) it's really quite a striking difference in quality:

Q4_K	11.238092	0.041044	91.176%
MXFP6	10.909077	0.014601	93.750%

TG could still be optimized so I would take any speeds with a grain of salt. If keeping Q8 for MMVQ it probably could come up substantially (or making an MXPF8 kernel better)
However, what I shared was still just MXFP6 x MXFP6.

What remains to be tried is effectively taking advantage of hardware MXFP4 x MXFP6 block scaling with kind::mxf8f6f4 instructions. Eg:

mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.block_scale.scale_vec::1X.f32.e2m3.e2m1.f32.ue8m0
  {%Rd0, %Rd1, %Rd2, %Rd3},
  {%Ra0, %Ra1, %Ra2, %Ra3},
  {%Rb0, %Rb1},
  {%Rc0, %Rc1, %Rc2, %Rc3},
  scaleAData, {0, 1}, scaleBData, {0, 1};

Note, only supports U8M0 for the scaletype for FP4 ; remains to be seen how that could work in a mixed NVFP4 case. And as to MXFP8 in the prior post, those were not using .kind::mxf8f6f4 )

I am testing an idea similar to 4 over 6 idea for NVFP4:
Making a smarter quantizer that can make a choice:
*FP6 improved scale choice selection with quick search (as we've done for NVFP4 easily)
*FP4 promotion: If a normalized FP4 scale lands around a non-existent 5.0, or otherwise has high SSE, use FP6 instead.
*FP6 demotion: If FP4 can provide the same result, use FP4. In some case when everything is lined up the error on FP4 is 0.0 so keeping as FP6 just wastes bytes
It would have to best determine how and when to aggregate across the tensor or tile to keep things moving fast.

am17an Apr 30, 2026
Collaborator

I was talking about 4 bit quantization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using MXFP6 to improve NVFP4 on llama.cpp #22498

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using MXFP6 to improve NVFP4 on llama.cpp #22498

Uh oh!

Uh oh!

michaelw9999 Apr 29, 2026

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

am17an Apr 29, 2026 Collaborator

Uh oh!

michaelw9999 Apr 29, 2026 Author

Uh oh!

am17an Apr 30, 2026 Collaborator

michaelw9999
Apr 29, 2026

Replies: 1 comment 2 replies

am17an
Apr 29, 2026
Collaborator

michaelw9999 Apr 29, 2026
Author

am17an Apr 30, 2026
Collaborator