ggml : add NVFP4 quantization type support#19769
Conversation
9cd0f58 to
86dd3fc
Compare
|
As is clearly laid out in the llama.cpp contributing guidelines:
|
|
I would really love NVFP4 support and I appreciate the work done here, but as @JohannesGaessler has already mentioned, the ratio of verified information to maintainer-needed work is way too high with this PR. Please:
|
|
It would be great if nvfp4 could be stored in larger blocks that are at least a multiple of 4B (16B would be better). |
|
I agree that memory alignment is relevant, as long as the tensor dimensions are multiples of e.g. 256 it should be feasible to permute the data upon load though (except for maybe CPU+GPU hybrid inference where the overhead could be relevant). |
Btw, @pwilkin these are not really necessary for NVFP4 - adding support for this data type would not depend on the outcome of these. They are good for sanity checks, but other than that do not matter much. The main use case of NVFP4 is to load models that are already trained in that format - not to quantize models with it. Regarding the alignment - I guess we can make blocks of 256 which would result in alignment of 16 bytes. Though we risk not being able to load tensors with dimension that is not multiple of 256. There was the same dilemma for MXFP4 and gpt-oss unfortunately has shapes that are only divisible by 64 but not 256. |
|
NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime. |
|
@ggerganov I know but I meant it exactly as a sanity check. |
Yeah I'm pretty frustrated as I was also thinking about working on it and was hoping this PR goes somewhere but seems it's going nowhere so far :/ |
It's taken into account. And regarding AI, as mentioned in the PR, I leaned on AI and following principles patterns applied in the MXFP4 PR. I'll remove the half-baked backend implementation and stick with NEON + generic CPU implementation for now. Again, this is a WIP which proves the concept and implements a lot of the boilerplate. I'll also increased blocksize to 64. |
It is not. Please see the f32 scale as presented here https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ As a reminder: you are supposed to know the content of the PR even if the PR is written with AI help. See the contributing guidelines. |
Addressed these comments. Here are results for Qwen3-4B
|
This comment was marked as outdated.
This comment was marked as outdated.
|
Okay, not sure if that works but if it does then it's great since it simplifies the implementation quite a bit. The current state of your PR is not ok though, I see random changes in the CUDA and Vulkan code. Can you fix it? |
5f8f21b to
ffab58b
Compare
Thanks, I noticed that as well. The problem was a one-time thing from the shelf commit targeting an older master. PR should be clean now |
|
@ORippler ping |
Sorry for the delayed response, was busy with #20391. I'd love for us to have 16-byte-alignement via AoSoA (we already have an AoSoA (the first array of is simply the pointer to
How is this divisibility-problem handled for other formats such as TLDR: I'd love to get 16-byte-alignment, but know I am obviously late to the party (this PR is already 3 weeks open and has gone a long way). Since we can repack for the CUDA backend I am fine if it's merged as is (though Vulkan and other IHVs that benefit from this alignement would miss out as it would be backend-specific implementation). Would still appreciate an answer to my points so I can learn and apply them during repacking (should repacking turn out to be pre-req for perf). |
|
Can't seem to dismiss my review due to missing rights, but do consider it dismissed as stale. |
I guess the mapping of models to quants is indeed currently sparse? Bummer. |
That's exactly it, you cannot use this quant for that tensor then, which obviously is unacceptable for NVFP4. For other formats another one that fits is chosen instead. |
In theory, repacking in the backend should solve the problem. I guess the repack implementation could be shared by multiple backends to avoid duplicated work for the repacking. I guess we could hold on merging this until we prototype this and make sure there aren't any surprises? |
Ouch. :) |
|
No worries, we don't have other alternatives either way, so if the repack does not work out we'll have to live with the 4 byte alignment. |
Well, come to think of it, can we not have two NVFP4 quants? One with 16-byte alignment and this one to fall back on if that won't fit? |
|
Sounds like too much redundancy and extra complexity for not much benefit. |
True, let's hope repacking pans out. |
|
4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal. |
* WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...
For synchronous data copies I agree, for asynchronous copies chunks of 16 bytes work better in my excperience. |
|
I've got the current version working with CUDA converting to pack SoA (without 4/6 or any fancy stuff) but it's not as fast as it should be (about 13,000 tk/s on Qwen4-B). Should I post it anywhere or do we have a thread to discuss follow up NVFP4 tasks? Having issues converting models and have fixes for the py script. Hope I can contribute something. Thanks |
@michaelw9999 I think individual PRs. Small isolated onces. If improvements are incremental, they should rather be separate PR's IMO. For example, one with basic CUDA support, one for 4/6 and maybe some fancy stuff etc |
|
The CUDA code should have the following pieces for basic support: NVFP4 dequantization + cuBLAS, MMVQ support, MMQ support via dp4a, MMQ support via tensor cores. For new contributors please only as individual and self-contained PRs, for more experienced contributors I think it's fine to do multiple things at once. Fancy stuff should come after that with evidence that it is an improvement. |
|
Thanks very much for the NVFP4 work!! I found two very interesting NVFP4 models on huggingface:
I tried to convert them to gguf, but both failed.
I was just wondering, if this are the kind of models that is intended to work with the NVFP4 support I have seen going into llama.cpp the last days. If yes, I tink I might have a go at trying to figure out why they fail. Not sure I will be able to find out how to fix, but eager to get my new expensive GPU to run at its best... |
|
Hello, I am getting error "Quant method is not yet supported: 'modelopt'" when trying to convert NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ( https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ ) to .gguf error log: #20411 (comment) |
Seems they have per-tensor |
Most likely stating the obvious: For MMVQ and MMQ dp4a path, it makes sense to do computations in BF16/FP16, as throughput is equal for FP and ALU in CUDA cores and we can save the I2F conversion via fp4 intrinsics (on the hardware that supports those of course). Just wanted to point this out as the CPU path in this PR does ALU followed by I2F. |
4 byte is the minimum we need to be able to issue LDGSTS via |
|
Regarding MMVQ: currently the activations are unconditionally converted to q8_1, if we intend to use floating-point math we will need to extend this. More generally, if we add a path using floating-point math it may make sense to use it for small matrices to remove the overhead from quantizing the activations. This table doesn't seem to list the throughput of |
I'm not super experienced with the ggml/gguf internals so feedback is very welcome. Note on AI usage: Claude Opus 4.6 was used for navigating the codebase, debugging, and writing parts of the code. All changes have been reviewed and tested manually. Open to reworking anything that doesn't meet the project's standards.
This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).
What's in here:
Tested with models from https://huggingface.co/NVFP4 Apple M5 MacBook (CPU, NEON) Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.
Here is a Qwen3-4B model to test with.