Skip to content

convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539

Open
richarddd wants to merge 1 commit intoggml-org:masterfrom
richarddd:fix/nvfp4-mixed-precision-convert
Open

convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539
richarddd wants to merge 1 commit intoggml-org:masterfrom
richarddd:fix/nvfp4-mixed-precision-convert

Conversation

@richarddd
Copy link
Contributor

Adds support for converting mixed-precision ModelOpt models (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) that use per-tensor quant_algo with both NVFP4 and FP8 layers, instead of a single global quant_algo: "NVFP4". NVFP4 tensors (2D scales) are repacked natively while FP8 tensors (1D scales) are dequantized to float.

Fixes: #20504

@richarddd richarddd requested a review from CISC as a code owner March 14, 2026 07:21
@github-actions github-actions bot added the python python script changes label Mar 14, 2026
@vbooka1
Copy link

vbooka1 commented Mar 14, 2026

Either llama.cpp or convert_hf_to_gguf.py is broken, model https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 does not work with builds 8304 (first supporting Nemotron 3) and 8334 (latest): I'm getting error llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 844, got 843.

Please check.

convert.txt

launch.txt

@richarddd
Copy link
Contributor Author

Either llama.cpp or convert_hf_to_gguf.py is broken, model https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 does not work with builds 8304 (first supporting Nemotron 3) and 8334 (latest): I'm getting error llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 844, got 843.

Please check.

convert.txt

launch.txt

Yeah something is off. I didn’t properly smoke test due to lack of memory.

@richarddd richarddd marked this pull request as draft March 14, 2026 20:47
@CISC
Copy link
Member

CISC commented Mar 14, 2026

@vbooka1 @richarddd Fixed by #20506

@richarddd richarddd marked this pull request as ready for review March 15, 2026 06:16
@richarddd richarddd force-pushed the fix/nvfp4-mixed-precision-convert branch from 3530623 to 585e8da Compare March 15, 2026 06:19
# Detect NVFP4 by checking for weight_scale tensors in the model.
if quant_algo != "NVFP4":
if any(k.endswith(".weight_scale") for k in self.model_tensors.keys()):
quant_algo = "NVFP4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer if you went through quantized_layers in the config and checked for NVFP4 instead to avoid unnecessary processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: convert_hf_to_gguf.py does not support Nemotron3 NVFP4

3 participants