Support FP4 gemm (1/2) by trevor-m · Pull Request #3899 · sgl-project/sglang

trevor-m · 2025-02-26T23:06:19Z

Motivation

This PR adds support for modelopt FP4 quantized models.
Tested using fp4 quantized Llama 3.1 model.

This work was adapted from the following - thanks @pavanimajety @kaixih @kushanam!
vllm-project/vllm#12784
vllm-project/vllm#13571
vllm-project/vllm#12520

Modifications

Adds two operations to sgl-kernel:

scaled_fp4_quant - Quantize bf16 or fp16 input to fp4 and returns input scale in block interleaved format
cutlass_scaled_fp4_mm - Perform fp4 gemm

Adds modelopt_fp4 quantization method.
Adds ModelOptFp4Config and ModelOptFp4LinearMethod to utilize new fp4 kernels for linear layers

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yiakwy-xpu-ml-framework-team · 2025-03-01T19:37:49Z

@trevor-m Great efforts! I am working on FP4 type. My suggestion is only to keep fp4 with uint4_t so that both platform (AMD/NV)can benefits from this function:

struct float4 {
  uint4_t i4data;
  union {
    float fval;
    uint32_t i32val;
    uint8_t i8val[4];
    uint4_t i4val[8];
  } val;
...
}:

Under this common definition, then we can implement an interface of CUDA device functions cast_fp8_to_fp4 and cast_fp4_to_fp8 vice veras.

The kernel should simply consists of basic bit wise operations or platform intrinsics for acceleration.

The kernel you added has too many PTX codes which is hard to be reused .

That PTX should deemed as NV intrinsics equivalent to bit wise implementation and only work on B200 platform.

My idea is that we need to support fp4 for all platforms not B200.

Could we co-deisgn kernel together ? And I can help implement bit-wise cuda kernels.

yiakwy-xpu-ml-framework-team · 2025-03-01T19:47:16Z

Only work for B200. It will be very hard for SGLang team to verify.

I suggested to implement FP4 cuda kernel without NV intrinsics acceleration first.

@yiakwy-xpu-ml-framework-team I believe this feature is specific to Blackwell and is being added accordingly. Creating a generic kernel is beyond the scope of this PR, but it would be welcome if someone sees inherent value in doing so.

kushanam · 2025-03-13T00:04:38Z

@pavanimajety @kaixih to review.

pavanimajety

Reviewed the integrations and tests, LGTM.

zhyncs · 2025-03-22T22:29:20Z

Hi @kushanam @elfiegg May you please review and verify this PR? Thanks!

Fix NAN issue by using getCurrentCUDAStream(). Apply rounding patch from trtllm (not needed for fixing NAN) Add fp4 unit tests Fix error spacing

trevor-m requested review from BBuf, ByronHsu, HandH1998, Ying1123, hnyls2002, ispobock, merrymercy, yizhang2077 and zhyncs as code owners February 26, 2025 23:06

trevor-m force-pushed the fp4-upstream branch 2 times, most recently from 0d97c39 to a62cfd8 Compare February 28, 2025 22:44

trevor-m changed the title ~~Support FP4 gemm and FP4 checkpoints~~ Support FP4 gemm (1/2) Feb 28, 2025

trevor-m mentioned this pull request Feb 28, 2025

FP4 weight loading and inference (2/2) #3972

Merged

6 tasks

trevor-m force-pushed the fp4-upstream branch from 4222dbd to 03af0f0 Compare February 28, 2025 22:54

yiakwy-xpu-ml-framework-team reviewed Mar 1, 2025

View reviewed changes

pavanimajety reviewed Mar 13, 2025

View reviewed changes

Comment thread sgl-kernel/src/sgl-kernel/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu Outdated

pavanimajety approved these changes Mar 13, 2025

View reviewed changes

kaixih reviewed Mar 14, 2025

View reviewed changes

Comment thread sgl-kernel/src/sgl-kernel/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu Outdated

trevor-m force-pushed the fp4-upstream branch 5 times, most recently from aad28b3 to bfa56b0 Compare March 18, 2025 01:07

zhyncs self-assigned this Mar 22, 2025

zhyncs added the high priority label Mar 22, 2025

trevor-m force-pushed the fp4-upstream branch from 59fbce3 to 6cacfbe Compare March 24, 2025 21:11

Add fp4 kernels to sgl-kernel

e718e69

Fix NAN issue by using getCurrentCUDAStream(). Apply rounding patch from trtllm (not needed for fixing NAN) Add fp4 unit tests Fix error spacing

trevor-m force-pushed the fp4-upstream branch from 6cacfbe to e718e69 Compare March 24, 2025 22:31

zhyncs merged commit e9f8e42 into sgl-project:main Mar 25, 2025

rdspring1 mentioned this pull request Jun 26, 2025

Add Cutlass NVFP4 Gemm to nvfuser_direct python bindings NVIDIA/Fuser#4676

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP4 gemm (1/2)#3899

Support FP4 gemm (1/2)#3899
zhyncs merged 1 commit intosgl-project:mainfrom
trevor-m:fp4-upstream

trevor-m commented Feb 26, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 1, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Mar 1, 2025

Uh oh!

kushanam Mar 11, 2025

Uh oh!

kushanam commented Mar 13, 2025

Uh oh!

Uh oh!

pavanimajety left a comment

Uh oh!

Uh oh!

zhyncs commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

trevor-m commented Feb 26, 2025

Motivation

Modifications

Checklist

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

kushanam Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

kushanam commented Mar 13, 2025

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhyncs commented Mar 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yiakwy-xpu-ml-framework-team commented Mar 1, 2025 •

edited

Loading