Using MXFP6 to improve NVFP4 on llama.cpp #22498
michaelw9999
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
|
From what you have posted it looks like Q4_K already provides better quantization benefits with roughly the same PP and vastly better TG. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
MXFP6 is something I've been exploring along with NVFP4, so I wanted to share some data from some experiments I did and hear thoughts/consensus.
It seems this rarely discussed (but vLLM has support) and that MXFP6 is yet another quant, but it is supported on Blackwell along with AMD datacenter MI350 and is an official OCP microscaling standard, so its use case is multi platform. I'm not sure about AMD or Intel's future plans for incorporating it on consumer GPUs. There's a nice AMD MXFP6 read up here so it seems they are promoting it, or at least were. External studies have shown FP4/FP6 combinations can seriously improve quality over NVFP4 alone, and possibly without much performance loss, and I wanted to see if that was something we could do on llama.cpp to NVFP4 with Blackwell hardware.
PR #16777 looked at bringing in an MXFP6_MoE type about 6 months ago. That never materialized and a question at the time was "what is the benefit?" That is hard to argue: Right now, there's essentially no models available for it - huggingface has 3 outdated, so this remains at the moment quite "useless". That was then and still remains true.
But, now that we have NVFP4, MXFP6 could be "next" for keeping fast performance and further improving model quality and could either stand on its own, or used in combination with NVFP4 (or anything else).
I created some working POC implementations to see what the data might show as to its potential usefulness. Consider these preliminary/unoptimized results just for deciding whether to pursue further. I used llama-quantize with Qwen3.5-4B, no imatrix, and then quantized MXFP8(U8M0), MXFP6 (E2M3), and a combined NVFP4/MXFP6, and then ran them against the same BF16 kld along with Q4_K, Q6_K, and Q8.
MMQ was kept as NVFP4 x NVFP4, or MXFP6 x MXFP6, or MXFP8 x MXFP8; MMVQ was kept as NVFP4/MXFP6/MXFP8 x Q8.
*11.55 from future NVFP4 version
On the NVFP4-MXFP6 mixed precision, I moved just a few layers off NVFP4 to MXFP6 as an adhoc experiment just as a first quick test. Note reduction in ppl; kld/top-p went down, .
NVFP4 here is the prefill speed winner followed by NVFP4/MXFP6.
*NVFP4 and MXFP6 via repack AoSoA
*MXFP8 mma via f32.e4m3.e4m3.f32.ue8m0
*MXFP6 mma via f32.e2m3.e2m3.f32.ue8m0
It appears from this preliminary data that MXFP6 is a strong contender with much better ppl/kld than NVFP4, without much increase in model size (depends on which layers; MXFP6=6.25BPW), and already without any effort to tune, was faster than Q6/Q4_K (ignoring tg which needs optimizing). So it's certainly feasible we can improve NVFP4 model quality by keeping some layers as MXFP6, without significantly increasing its size or losing performance (at least from this one example), and could be worth pursuing. Conversely, some tensors that would not otherwise be suitable to quantize down to NVFP4 now could work as MXFP6, perhaps making the model smaller, so testing would be needed to figure out a perfect balance.
Beta Was this translation helpful? Give feedback.
All reactions