🔖 Feature description
Support MXFP4 (Microscaling) Format for QAT and Post-Training Quantization via torchao/Model-Optimizer.
Currently, Axolotl users attempting to use 4-bit floating-point formats may run into hardware-specific constraints (e.g., the nvfp4 error which is exclusive to Blackwell sm100). This feature request proposes adding support for MXFP4 (E2M1), a hardware-agnostic OCP standard that is supported on NVIDIA Hopper (H100/H800) and can be emulated efficiently on Ampere.
Implementing MXFP4 QAT will allow:
- Higher training stability compared to INT4/FP4.
- Better post-training weight compression for LLMs like
gpt-oss.
- Alignment with NVIDIA's
model-optimizer and torchao roadmaps.
✔️ Solution
Integrate torchao.quantization.quantize_ with MXFP4 specific configs or utilize NVIDIA's modelopt (Model Optimizer) workflow within Axolotl's quantization CLI.
Key components:
- Add
mxfp4 as a valid option for quantization.weight_dtype in the YAML config.
- Implement the MXFP4 fake-quantization logic in
axolotl.utils.quantization during the QAT phase.
- Ensure compatibility with
torchao's MX format implementations (specifically mx_fp4).
References:
❓ Alternatives
Currently, users are forced to use int4_weight_only or fp8, which either lacks the dynamic range of MXFP4 or doesn't provide the same 4-bit memory savings.
📝 Additional Context
As LLMs like gpt-oss (120B+) grow, 4-bit quantization becomes critical for inference. MXFP4 provides a sweet spot between 8-bit accuracy and 4-bit efficiency by using shared scales across groups of elements (e.g., block size 16 or 32).
axolotl-ai-cloud/axolotl#3333
🔖 Feature description
Support MXFP4 (Microscaling) Format for QAT and Post-Training Quantization via torchao/Model-Optimizer.
Currently, Axolotl users attempting to use 4-bit floating-point formats may run into hardware-specific constraints (e.g., the
nvfp4error which is exclusive to Blackwell sm100). This feature request proposes adding support for MXFP4 (E2M1), a hardware-agnostic OCP standard that is supported on NVIDIA Hopper (H100/H800) and can be emulated efficiently on Ampere.Implementing MXFP4 QAT will allow:
gpt-oss.model-optimizerandtorchaoroadmaps.✔️ Solution
Integrate
torchao.quantization.quantize_with MXFP4 specific configs or utilize NVIDIA'smodelopt(Model Optimizer) workflow within Axolotl's quantization CLI.Key components:
mxfp4as a valid option forquantization.weight_dtypein the YAML config.axolotl.utils.quantizationduring the QAT phase.torchao's MX format implementations (specificallymx_fp4).References:
❓ Alternatives
Currently, users are forced to use
int4_weight_onlyorfp8, which either lacks the dynamic range of MXFP4 or doesn't provide the same 4-bit memory savings.📝 Additional Context
As LLMs like
gpt-oss(120B+) grow, 4-bit quantization becomes critical for inference. MXFP4 provides a sweet spot between 8-bit accuracy and 4-bit efficiency by using shared scales across groups of elements (e.g., block size 16 or 32).axolotl-ai-cloud/axolotl#3333