Skip to content

feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs#19143

Merged
HaiShaw merged 2 commits intosgl-project:mainfrom
fengli1702:petit-mxfp4
Apr 16, 2026
Merged

feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs#19143
HaiShaw merged 2 commits intosgl-project:mainfrom
fengli1702:petit-mxfp4

Conversation

@fengli1702
Copy link
Copy Markdown
Contributor

Motivation

This PR leverages Petit to efficiently emulate FP4 on AMD CDNA2 / CDNA3 GPUs, enabling SGLang to run FP4-quantized models (e.g., amd--Llama-3.3-70B-Instruct-MXFP4-Preview quantized by ModelOpt) on AMD GPUs that do not natively support FP4.

Modifications

Add a new quantization scheme: petit_mxfp4
Targets FP4 / MXFP4 models quantized by ModelOpt
Integrates with SGLang quantization pipeline so users can load and serve FP4 models by setting quantization="petit_mxfp4".

Benchmarks

Environment

  • Hardware: 8x MI300X GPUs
  • Python: 3.12.12
  • Model: amd--Llama-3.3-70B-Instruct (BF16 baseline) & amd--Llama-3.3-70B-Instruct-MXFP4-Preview (quantized)
  • Dependencies:
    • petit_kernel: 0.0.3

Accuracy Benchmarks

Model: amd--Llama-3.3-70B-Instruct (BF16 baseline) vs amd--Llama-3.3-70B-Instruct-MXFP4-Preview (quantized)
Hardware: 8x MI300X
Setup: TP=8, max_model_len=4096, mem_fraction_static=0.9, batch_size=128

Task BF16 FP4 (petit_mxfp4)
MMLU 82.22 81.29
GSM8K_COT (8-shot, strict-match) 95.38 93.78
ARC challenge (0-shot) 92.96 92.79
IFEVAL (0-shot, (inst_level_strict_acc+prompt_level_strict_acc)/2) 90.74 89.10

Throughput Benchmarks

Settings

export SGLANG_USE_AITER=1
python3 -m sglang.bench_offline_throughput --model-path /models --tensor-parallel-size 8 --num-prompts {10,64}

BF16 Baseline

With 10 prompts:

====== Offline Throughput Benchmark Result =======
Backend: engine
Successful requests: 10
Benchmark duration (s): 5.36
Total input tokens: 1960
Total generated tokens: 2774
Last generation throughput (tok/s): 109.69
Request throughput (req/s): 1.87
Input token throughput (tok/s): 365.68
Output token throughput (tok/s): 517.55
Total token throughput (tok/s): 883.24

With 64 prompts:

====== Offline Throughput Benchmark Result =======
Backend: engine
Successful requests: 64
Benchmark duration (s): 18.00
Total input tokens: 19441
Total generated tokens: 13499
Last generation throughput (tok/s): 109.57
Request throughput (req/s): 3.56
Input token throughput (tok/s): 1080.30
Output token throughput (tok/s): 750.11
Total token throughput (tok/s): 1830.41

FP4 (MXFP4 with Petit)

With 10 prompts:

====== Offline Throughput Benchmark Result =======
Backend: engine
Successful requests: 10
Benchmark duration (s): 5.50
Total input tokens: 1960
Total generated tokens: 2774
Last generation throughput (tok/s): 130.88
Request throughput (req/s): 1.82
Input token throughput (tok/s): 356.06
Output token throughput (tok/s): 503.93
Total token throughput (tok/s): 859.99

With 64 prompts:

====== Offline Throughput Benchmark Result =======
Backend: engine
Successful requests: 64
Benchmark duration (s): 23.80
Total input tokens: 19441
Total generated tokens: 13499
Last generation throughput (tok/s): 130.87
Request throughput (req/s): 2.69
Input token throughput (tok/s): 816.74
Output token throughput (tok/s): 567.11
Total token throughput (tok/s): 1383.84

How to Reproduce

Smoke Test (Python API)

import sglang as sgl

engine = sgl.Engine(
    model_path="/path/to/amd--Llama-3.3-70B-Instruct-MXFP4-Preview",
    tp_size=8,
    context_length=4096,
    mem_fraction_static=0.9,
    trust_remote_code=True,
    attention_backend="triton",
    quantization="petit_mxfp4",
)

print(engine.generate("hello"))
engine.shutdown()

Offline Throughput Benchmark

export SGLANG_USE_AITER=1

# MXFP4 (10 prompts)
python3 -m sglang.bench_offline_throughput \
    --model-path /path/to/amd--Llama-3.3-70B-Instruct-MXFP4-Preview \
    --tensor-parallel-size 8 \
    --num-prompts 10

# MXFP4 (64 prompts)
python3 -m sglang.bench_offline_throughput \
    --model-path /path/to/amd--Llama-3.3-70B-Instruct-MXFP4-Preview \
    --tensor-parallel-size 8 \
    --num-prompts 64

Accuracy Evaluation (MMLU)

PYTHONPATH=/path/to/sglang/python:${PYTHONPATH:-} \
python3 -m lm_eval \
  --tasks mmlu \
  --model sglang \
  --model_args pretrained=/path/to/amd--Llama-3.3-70B-Instruct-MXFP4-Preview,tp_size=8,max_model_len=4096,mem_fraction_static=0.9,trust_remote_code=True,quantization=petit_mxfp4 \
  --num_fewshot 5 \
  --batch_size 128

Checklist

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Feb 22, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @fengli1702, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's compatibility with AMD GPUs by introducing support for MXFP4 quantized models. By integrating the Petit kernel, it allows users to run FP4-quantized models, such as those from ModelOpt, on AMD CDNA2/CDNA3 architectures, which previously lacked native FP4 capabilities. This change broadens the range of hardware where SGLang can efficiently deploy advanced language models, providing a crucial performance uplift for AMD users.

Highlights

  • New Quantization Scheme: Introduced a new petit_mxfp4 quantization scheme to support MXFP4 quantized models, specifically targeting AMD CDNA2/CDNA3 GPUs.
  • AMD GPU Compatibility: Enabled efficient emulation of FP4 on AMD CDNA2/CDNA3 GPUs by leveraging the Petit kernel, allowing SGLang to run FP4-quantized models on hardware that lacks native FP4 support.
  • Petit Kernel Integration: Integrated the petit_mxfp4 scheme into the SGLang quantization pipeline, allowing users to load and serve FP4 models by setting quantization="petit_mxfp4".
  • Updated Petit Configuration: Modified the existing PetitNvFp4Config to be more general, now supporting both NVFP4 (NVIDIA) and MXFP4 (AMD) formats, and added dedicated utility functions for MXFP4 processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/pyproject_other.toml
    • Updated the petit_kernel dependency version from 0.0.2 to 0.0.3.
  • python/sglang/srt/configs/model_config.py
    • Added petit_mxfp4 to the list of allowed quantization methods.
    • Included petit_mxfp4 in the compatible_quantization_methods dictionary, associating it with mxfp4 and modelopt.
  • python/sglang/srt/layers/quantization/init.py
    • Imported the new PetitMxfp4Config.
    • Added petit_mxfp4 to the QUANTIZATION_CONFIGS mapping.
  • python/sglang/srt/layers/quantization/petit.py
    • Updated the docstring for PetitNvFp4Config to indicate support for both NVFP4 and MXFP4.
    • Added a quant_format parameter to the PetitNvFp4Config constructor, defaulting to nvfp4.
    • Modified the from_config method to detect and set the quant_format based on quant_method and _is_hip.
    • Updated error messages to refer to NVFP4/MXFP4 quantization.
    • Expanded the docstring for PetitNvFp4LinearMethod to describe MXFP4 weight structure and support for AMD GPUs.
  • python/sglang/srt/layers/quantization/petit_mxfp4.py
    • Added a new file defining PetitMxfp4Config for MXFP4 linear inference on ROCm.
    • Implemented PetitMxfp4LinearMethod to handle weight creation and application for MXFP4 weights with Petit kernel execution.
    • Included class methods for get_name, get_supported_act_dtypes, get_min_capability, get_config_filenames, from_config, override_quantization_method, and is_petit_mxfp4_compatible.
  • python/sglang/srt/layers/quantization/petit_utils.py
    • Refactored Petit kernel imports to handle ImportError gracefully for all Petit functions.
    • Introduced _PETIT_INSTALL_ERROR constant for consistent error messages.
    • Added _require_petit_kernel function to check for Petit installation before kernel calls.
    • Modified _check_petit_nvfp4_supported to use case-insensitive matching for quant_method.
    • Added new functions _check_petit_mxfp4_supported and verify_petit_mxfp4_supported for MXFP4 specific checks.
    • Added prepare_mxfp4_layer_for_petit to repack MXFP4 weights and process scales for Petit.
    • Added apply_petit_mxfp4_linear for performing linear operations with MXFP4 weights using the Petit kernel.
  • python/sglang/srt/server_args.py
    • Added petit_mxfp4 to the list of allowed quantization methods in QUANTIZATION_METHODS.
Activity
  • No human activity (comments, reviews, or explicit progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for MXFP4 quantized models on AMD GPUs by leveraging the Petit library. The changes introduce a new quantization scheme petit_mxfp4 and modify the existing petit_nvfp4 logic. While the overall approach is sound, there are some critical issues in python/sglang/srt/layers/quantization/petit.py where the attempt to handle both nvfp4 and mxfp4 formats is incomplete and contains bugs. Additionally, there's a minor code duplication issue that could be addressed for better maintainability. My review provides specific feedback on these points.

Comment on lines +103 to +108
def is_layer_excluded(self, prefix: str, exclude_modules: List[str]) -> bool:
for pattern in exclude_modules:
regex_str = pattern.replace(".", r"\.").replace("*", r".*")
if re.fullmatch(regex_str, prefix):
return True
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is_layer_excluded method is also present in PetitNvFp4Config in petit.py. To avoid code duplication, consider moving this helper function to a common utility file or to the base QuantizationConfig class if it's generally applicable.

Copy link
Copy Markdown
Contributor Author

@fengli1702 fengli1702 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the duplication by extracting layer-pattern exclusion into a shared helper (is_layer_excluded_by_patterns) and reused it in both petit.py and petit_mxfp4.py.

@fengli1702 fengli1702 force-pushed the petit-mxfp4 branch 2 times, most recently from 2651b2f to 7c88c83 Compare February 22, 2026 06:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 600bb8594e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +87 to +88
if cls.is_petit_mxfp4_compatible(hf_quant_cfg):
return cls.get_name()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict petit_mxfp4 auto-selection to explicit opt-in

This override unconditionally returns petit_mxfp4 for any ROCm checkpoint whose quant config mentions MXFP4, which hijacks existing MXFP4 flows that were using mxfp4 (including MoE). In this commit, PetitMxfp4Config.get_quant_method only handles LinearBase, while MXFP4 MoE handling is implemented in mxfp4.py, so auto-switching all MXFP4 models here can drop the MoE quantization path and break model loading/inference for those checkpoints; this should be gated by user intent (for example, only when user_quant == "petit_mxfp4").

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gated PetitMxfp4Config.override_quantization_method to only auto-select when the user explicitly sets quantization=petit_mxfp4, preserving existing MXFP4/MoE flows.

Comment on lines +88 to +89
is_checkpoint_nvfp4_serialized = "NVFP4" in quant_method or (
is_mxfp4 and _is_hip
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize NVFP4 case before setting serialized-checkpoint flag

After making NVFP4 support checks case-insensitive, this line still uses a case-sensitive substring test, so configs with quant_algo: "nvfp4" now pass validation but are marked as non-serialized (False) and later fail with the dynamic-quantization error path. Use a case-insensitive check here to keep validation and runtime behavior consistent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made NVFP4 serialized-checkpoint detection case-insensitive in petit.py so lowercase nvfp4 configs stay consistent between validation and runtime behavior.

Copy link
Copy Markdown
Collaborator

@BowenBao BowenBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution! I'm not sure if https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview is the model mentioned in the PR description. It is quantized by quark.

If not could you try if it can be supported by petit_mxfp4? Thank you.

"modelopt_fp8": ["modelopt"],
"modelopt_fp4": ["modelopt"],
"petit_nvfp4": ["modelopt"],
"petit_mxfp4": ["mxfp4", "modelopt"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably only "quark" is the compatible quant method, I'm not sure if there are modelopt mxfp4 models?
"mxfp4" quant method supports moe only at the moment afaik

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Yes, the target model is amd/Llama-3.3-70B-Instruct-MXFP4-Preview, and its quant method is quark. To clarify, amd/Llama-3.3-70B-Instruct-MXFP4-Preview is an MXFP4 checkpoint, while its quant_method field is quark.

I updated petit_mxfp4 compatibility to include quark (and removed modelopt from this path). The mapping is now aligned with the actual supported methods for this flow.

"""Config class for Petit FP4."""
"""Config class for Petit FP4.

Supports both NVFP4 (for NVIDIA GPUs) and MXFP4 (for AMD GPUs).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this also supports MXFP4, should the config class be renamed? e.g. PetitNvFp4Config -> PetitFp4Config

What's the relation of this with python/sglang/srt/layers/quantization/petit_mxfp4.py? both seem to support mxfp4?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to make the boundary explicit:

  • petit_nvfp4.py: NVFP4-only implementation (PetitNvFp4Config).
  • petit_mxfp4.py: MXFP4-only implementation.
  • petit.py: now only a backward-compatible import shim for NVFP4 symbols.

So there is no longer functional overlap between petit.py and petit_mxfp4.py.

def _check_petit_mxfp4_supported(
quant_method: str, group_size: Optional[int]
) -> tuple[bool, Optional[str]]:
if "MXFP4" not in quant_method.upper():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does petit_mxfp4 support running quark model? the quant_method would show up as "quark" in that case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. petit_mxfp4 now supports the quant_method="quark" path.

I updated the support check to accept Quark-MXFP4 and added config validation for Quark MXFP4 characteristics. If quant_method is quark but the config is not MXFP4-compatible, it fails with a clear error message.

@github-actions github-actions Bot added the blackwell SM100/SM120 label Apr 3, 2026
@fengli1702 fengli1702 force-pushed the petit-mxfp4 branch 2 times, most recently from d5531f8 to 76ebe92 Compare April 3, 2026 16:00
Copy link
Copy Markdown
Collaborator

@BowenBao BowenBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, thank you!

cc @kkHuang-amd , @HaiShaw

if not candidates:
return True

return any(_is_quark_mxfp4_layer_quant_config(cfg) for cfg in candidates)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return any(_is_quark_mxfp4_layer_quant_config(cfg) for cfg in candidates)
return all(_is_quark_mxfp4_layer_quant_config(cfg) for cfg in candidates)

Will this be safer in case model has mixed precision support? e.g., some other layers with weights quantized as fp8.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this check from any(...) to all(...), so we only treat it as Quark-MXFP4 compatible when all discovered quant configs match MXFP4 weight requirements. This is safer for potential mixed-precision configs.

and weight_quant.get("is_dynamic") is False
and input_quant.get("is_dynamic") is True
and weight_quant.get("scale_format") == "e8m0"
and input_quant.get("scale_format") == "e8m0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think input_quant check could be potentially relaxed altogether. Since the model you listed all have w4a4 quant, but your kernel is w4a16.

perhaps a warning_once that input_quant, if set in config, is ignored under petit would suffice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I relaxed the input_tensors check for petit_mxfp4. Since this path is w4a16, input quant config is not used by the kernel. We now validate weight quant config only, and emit a warning_once when input_tensors quant config is present but ignored under petit.

@fengli1702 fengli1702 force-pushed the petit-mxfp4 branch 2 times, most recently from 59c6bf2 to d927e19 Compare April 4, 2026 04:39
@HaiShaw HaiShaw merged commit 2cc52d8 into sgl-project:main Apr 16, 2026
8 of 14 checks passed
@bingxche
Copy link
Copy Markdown
Collaborator

Hi @fengli1702, this PR set petit_kernel==0.0.3, this needs python>3.12, but in amd sglang daily image python version is 3.10. This will cause install dependency error for all AMD CI. Please revert first.

https://github.com/sgl-project/sglang/actions/runs/24541573222

cc @HaiShaw @yctseng0211

@yctseng0211
Copy link
Copy Markdown
Collaborator

yctseng0211 commented Apr 17, 2026

yctseng0211 added a commit to yctseng0211/sglang that referenced this pull request Apr 17, 2026
@haohui
Copy link
Copy Markdown
Contributor

haohui commented Apr 17, 2026

Thanks for pointing it out. Just published the cp310 wheel. Please retrigger the CI. Thanks

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants