convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539
convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539richarddd wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Either Please check. |
Yeah something is off. I didn’t properly smoke test due to lack of memory. |
|
@vbooka1 @richarddd Fixed by #20506 |
3530623 to
585e8da
Compare
| # Detect NVFP4 by checking for weight_scale tensors in the model. | ||
| if quant_algo != "NVFP4": | ||
| if any(k.endswith(".weight_scale") for k in self.model_tensors.keys()): | ||
| quant_algo = "NVFP4" |
There was a problem hiding this comment.
I think I would prefer if you went through quantized_layers in the config and checked for NVFP4 instead to avoid unnecessary processing.
Adds support for converting mixed-precision ModelOpt models (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) that use per-tensor quant_algo with both NVFP4 and FP8 layers, instead of a single global quant_algo: "NVFP4". NVFP4 tensors (2D scales) are repacked natively while FP8 tensors (1D scales) are dequantized to float.
Fixes: #20504