feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs#19143
feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs#19143HaiShaw merged 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @fengli1702, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's compatibility with AMD GPUs by introducing support for MXFP4 quantized models. By integrating the Petit kernel, it allows users to run FP4-quantized models, such as those from ModelOpt, on AMD CDNA2/CDNA3 architectures, which previously lacked native FP4 capabilities. This change broadens the range of hardware where SGLang can efficiently deploy advanced language models, providing a crucial performance uplift for AMD users. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for MXFP4 quantized models on AMD GPUs by leveraging the Petit library. The changes introduce a new quantization scheme petit_mxfp4 and modify the existing petit_nvfp4 logic. While the overall approach is sound, there are some critical issues in python/sglang/srt/layers/quantization/petit.py where the attempt to handle both nvfp4 and mxfp4 formats is incomplete and contains bugs. Additionally, there's a minor code duplication issue that could be addressed for better maintainability. My review provides specific feedback on these points.
| def is_layer_excluded(self, prefix: str, exclude_modules: List[str]) -> bool: | ||
| for pattern in exclude_modules: | ||
| regex_str = pattern.replace(".", r"\.").replace("*", r".*") | ||
| if re.fullmatch(regex_str, prefix): | ||
| return True | ||
| return False |
There was a problem hiding this comment.
I removed the duplication by extracting layer-pattern exclusion into a shared helper (is_layer_excluded_by_patterns) and reused it in both petit.py and petit_mxfp4.py.
2651b2f to
7c88c83
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 600bb8594e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if cls.is_petit_mxfp4_compatible(hf_quant_cfg): | ||
| return cls.get_name() |
There was a problem hiding this comment.
Restrict petit_mxfp4 auto-selection to explicit opt-in
This override unconditionally returns petit_mxfp4 for any ROCm checkpoint whose quant config mentions MXFP4, which hijacks existing MXFP4 flows that were using mxfp4 (including MoE). In this commit, PetitMxfp4Config.get_quant_method only handles LinearBase, while MXFP4 MoE handling is implemented in mxfp4.py, so auto-switching all MXFP4 models here can drop the MoE quantization path and break model loading/inference for those checkpoints; this should be gated by user intent (for example, only when user_quant == "petit_mxfp4").
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I gated PetitMxfp4Config.override_quantization_method to only auto-select when the user explicitly sets quantization=petit_mxfp4, preserving existing MXFP4/MoE flows.
| is_checkpoint_nvfp4_serialized = "NVFP4" in quant_method or ( | ||
| is_mxfp4 and _is_hip |
There was a problem hiding this comment.
Normalize NVFP4 case before setting serialized-checkpoint flag
After making NVFP4 support checks case-insensitive, this line still uses a case-sensitive substring test, so configs with quant_algo: "nvfp4" now pass validation but are marked as non-serialized (False) and later fail with the dynamic-quantization error path. Use a case-insensitive check here to keep validation and runtime behavior consistent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I made NVFP4 serialized-checkpoint detection case-insensitive in petit.py so lowercase nvfp4 configs stay consistent between validation and runtime behavior.
7c88c83 to
624e3ff
Compare
BowenBao
left a comment
There was a problem hiding this comment.
Thank you for contribution! I'm not sure if https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview is the model mentioned in the PR description. It is quantized by quark.
If not could you try if it can be supported by petit_mxfp4? Thank you.
| "modelopt_fp8": ["modelopt"], | ||
| "modelopt_fp4": ["modelopt"], | ||
| "petit_nvfp4": ["modelopt"], | ||
| "petit_mxfp4": ["mxfp4", "modelopt"], |
There was a problem hiding this comment.
probably only "quark" is the compatible quant method, I'm not sure if there are modelopt mxfp4 models?
"mxfp4" quant method supports moe only at the moment afaik
There was a problem hiding this comment.
Thanks for the review. Yes, the target model is amd/Llama-3.3-70B-Instruct-MXFP4-Preview, and its quant method is quark. To clarify, amd/Llama-3.3-70B-Instruct-MXFP4-Preview is an MXFP4 checkpoint, while its quant_method field is quark.
I updated petit_mxfp4 compatibility to include quark (and removed modelopt from this path). The mapping is now aligned with the actual supported methods for this flow.
| """Config class for Petit FP4.""" | ||
| """Config class for Petit FP4. | ||
|
|
||
| Supports both NVFP4 (for NVIDIA GPUs) and MXFP4 (for AMD GPUs). |
There was a problem hiding this comment.
if this also supports MXFP4, should the config class be renamed? e.g. PetitNvFp4Config -> PetitFp4Config
What's the relation of this with python/sglang/srt/layers/quantization/petit_mxfp4.py? both seem to support mxfp4?
There was a problem hiding this comment.
I refactored this to make the boundary explicit:
petit_nvfp4.py: NVFP4-only implementation (PetitNvFp4Config).petit_mxfp4.py: MXFP4-only implementation.petit.py: now only a backward-compatible import shim for NVFP4 symbols.
So there is no longer functional overlap between petit.py and petit_mxfp4.py.
| def _check_petit_mxfp4_supported( | ||
| quant_method: str, group_size: Optional[int] | ||
| ) -> tuple[bool, Optional[str]]: | ||
| if "MXFP4" not in quant_method.upper(): |
There was a problem hiding this comment.
does petit_mxfp4 support running quark model? the quant_method would show up as "quark" in that case.
There was a problem hiding this comment.
Yes. petit_mxfp4 now supports the quant_method="quark" path.
I updated the support check to accept Quark-MXFP4 and added config validation for Quark MXFP4 characteristics. If quant_method is quark but the config is not MXFP4-compatible, it fails with a clear error message.
d5531f8 to
76ebe92
Compare
BowenBao
left a comment
There was a problem hiding this comment.
Looks good overall, thank you!
cc @kkHuang-amd , @HaiShaw
| if not candidates: | ||
| return True | ||
|
|
||
| return any(_is_quark_mxfp4_layer_quant_config(cfg) for cfg in candidates) |
There was a problem hiding this comment.
| return any(_is_quark_mxfp4_layer_quant_config(cfg) for cfg in candidates) | |
| return all(_is_quark_mxfp4_layer_quant_config(cfg) for cfg in candidates) |
Will this be safer in case model has mixed precision support? e.g., some other layers with weights quantized as fp8.
There was a problem hiding this comment.
I changed this check from any(...) to all(...), so we only treat it as Quark-MXFP4 compatible when all discovered quant configs match MXFP4 weight requirements. This is safer for potential mixed-precision configs.
| and weight_quant.get("is_dynamic") is False | ||
| and input_quant.get("is_dynamic") is True | ||
| and weight_quant.get("scale_format") == "e8m0" | ||
| and input_quant.get("scale_format") == "e8m0" |
There was a problem hiding this comment.
I think input_quant check could be potentially relaxed altogether. Since the model you listed all have w4a4 quant, but your kernel is w4a16.
perhaps a warning_once that input_quant, if set in config, is ignored under petit would suffice.
There was a problem hiding this comment.
Agreed. I relaxed the input_tensors check for petit_mxfp4. Since this path is w4a16, input quant config is not used by the kernel. We now validate weight quant config only, and emit a warning_once when input_tensors quant config is present but ignored under petit.
59c6bf2 to
d927e19
Compare
|
Hi @fengli1702, this PR set petit_kernel==0.0.3, this needs python>3.12, but in amd sglang daily image python version is 3.10. This will cause install dependency error for all AMD CI. Please revert first. https://github.com/sgl-project/sglang/actions/runs/24541573222 |
|
petit_kernel 0.0.2 has both cp310 + cp312 wheels : https://pypi.org/project/petit-kernel/0.0.2/#files
petit_kernel 0.0.3 only has cp312 : https://pypi.org/project/petit-kernel/0.0.3/#files |
… GPUs (sgl-project#19143)" This reverts commit 2cc52d8.
|
Thanks for pointing it out. Just published the cp310 wheel. Please retrigger the CI. Thanks |
Motivation
This PR leverages Petit to efficiently emulate FP4 on AMD CDNA2 / CDNA3 GPUs, enabling SGLang to run FP4-quantized models (e.g., amd--Llama-3.3-70B-Instruct-MXFP4-Preview quantized by ModelOpt) on AMD GPUs that do not natively support FP4.
Modifications
Add a new quantization scheme: petit_mxfp4
Targets FP4 / MXFP4 models quantized by ModelOpt
Integrates with SGLang quantization pipeline so users can load and serve FP4 models by setting quantization="petit_mxfp4".
Benchmarks
Environment
Accuracy Benchmarks
Model: amd--Llama-3.3-70B-Instruct (BF16 baseline) vs amd--Llama-3.3-70B-Instruct-MXFP4-Preview (quantized)
Hardware: 8x MI300X
Setup: TP=8, max_model_len=4096, mem_fraction_static=0.9, batch_size=128
Throughput Benchmarks
Settings
export SGLANG_USE_AITER=1 python3 -m sglang.bench_offline_throughput --model-path /models --tensor-parallel-size 8 --num-prompts {10,64}BF16 Baseline
With 10 prompts:
With 64 prompts:
FP4 (MXFP4 with Petit)
With 10 prompts:
With 64 prompts:
How to Reproduce
Smoke Test (Python API)
Offline Throughput Benchmark
Accuracy Evaluation (MMLU)
PYTHONPATH=/path/to/sglang/python:${PYTHONPATH:-} \ python3 -m lm_eval \ --tasks mmlu \ --model sglang \ --model_args pretrained=/path/to/amd--Llama-3.3-70B-Instruct-MXFP4-Preview,tp_size=8,max_model_len=4096,mem_fraction_static=0.9,trust_remote_code=True,quantization=petit_mxfp4 \ --num_fewshot 5 \ --batch_size 128Checklist