Skip to content

Support FP4 gemm (1/2)#3899

Merged
zhyncs merged 1 commit intosgl-project:mainfrom
trevor-m:fp4-upstream
Mar 25, 2025
Merged

Support FP4 gemm (1/2)#3899
zhyncs merged 1 commit intosgl-project:mainfrom
trevor-m:fp4-upstream

Conversation

@trevor-m
Copy link
Copy Markdown
Collaborator

Motivation

This PR adds support for modelopt FP4 quantized models.
Tested using fp4 quantized Llama 3.1 model.

This work was adapted from the following - thanks @pavanimajety @kaixih @kushanam!
vllm-project/vllm#12784
vllm-project/vllm#13571
vllm-project/vllm#12520

Modifications

Adds two operations to sgl-kernel:

  • scaled_fp4_quant - Quantize bf16 or fp16 input to fp4 and returns input scale in block interleaved format
  • cutlass_scaled_fp4_mm - Perform fp4 gemm

Adds modelopt_fp4 quantization method.
Adds ModelOptFp4Config and ModelOptFp4LinearMethod to utilize new fp4 kernels for linear layers

Checklist

@trevor-m trevor-m force-pushed the fp4-upstream branch 2 times, most recently from 0d97c39 to a62cfd8 Compare February 28, 2025 22:44
@trevor-m trevor-m changed the title Support FP4 gemm and FP4 checkpoints Support FP4 gemm (1/2) Feb 28, 2025
@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor

yiakwy-xpu-ml-framework-team commented Mar 1, 2025

@trevor-m Great efforts! I am working on FP4 type. My suggestion is only to keep fp4 with uint4_t so that both platform (AMD/NV)can benefits from this function:

struct float4 {
  uint4_t i4data;
  union {
    float fval;
    uint32_t i32val;
    uint8_t i8val[4];
    uint4_t i4val[8];
  } val;
...
}:

Under this common definition, then we can implement an interface of CUDA device functions cast_fp8_to_fp4 and cast_fp4_to_fp8 vice veras.

The kernel should simply consists of basic bit wise operations or platform intrinsics for acceleration.

The kernel you added has too many PTX codes which is hard to be reused .

That PTX should deemed as NV intrinsics equivalent to bit wise implementation and only work on B200 platform.

My idea is that we need to support fp4 for all platforms not B200.

Could we co-deisgn kernel together ? And I can help implement bit-wise cuda kernels.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only work for B200. It will be very hard for SGLang team to verify.

I suggested to implement FP4 cuda kernel without NV intrinsics acceleration first.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiakwy-xpu-ml-framework-team I believe this feature is specific to Blackwell and is being added accordingly. Creating a generic kernel is beyond the scope of this PR, but it would be welcome if someone sees inherent value in doing so.

@kushanam
Copy link
Copy Markdown
Collaborator

@pavanimajety @kaixih to review.

Comment thread sgl-kernel/src/sgl-kernel/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu Outdated
Copy link
Copy Markdown
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the integrations and tests, LGTM.

Comment thread sgl-kernel/src/sgl-kernel/csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu Outdated
@trevor-m trevor-m force-pushed the fp4-upstream branch 5 times, most recently from aad28b3 to bfa56b0 Compare March 18, 2025 01:07
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 22, 2025

Hi @kushanam @elfiegg May you please review and verify this PR? Thanks!

Fix NAN issue by using getCurrentCUDAStream(). Apply rounding patch from trtllm (not needed for fixing NAN)

Add fp4 unit tests

Fix error spacing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants