Support FP4 gemm (1/2)#3899
Conversation
0d97c39 to
a62cfd8
Compare
4222dbd to
03af0f0
Compare
|
@trevor-m Great efforts! I am working on FP4 type. My suggestion is only to keep fp4 with uint4_t so that both platform (AMD/NV)can benefits from this function: Under this common definition, then we can implement an interface of CUDA device functions cast_fp8_to_fp4 and cast_fp4_to_fp8 vice veras. The kernel should simply consists of basic bit wise operations or platform intrinsics for acceleration. The kernel you added has too many PTX codes which is hard to be reused . That PTX should deemed as NV intrinsics equivalent to bit wise implementation and only work on B200 platform. My idea is that we need to support fp4 for all platforms not B200. Could we co-deisgn kernel together ? And I can help implement bit-wise cuda kernels. |
There was a problem hiding this comment.
Only work for B200. It will be very hard for SGLang team to verify.
I suggested to implement FP4 cuda kernel without NV intrinsics acceleration first.
There was a problem hiding this comment.
@yiakwy-xpu-ml-framework-team I believe this feature is specific to Blackwell and is being added accordingly. Creating a generic kernel is beyond the scope of this PR, but it would be welcome if someone sees inherent value in doing so.
|
@pavanimajety @kaixih to review. |
pavanimajety
left a comment
There was a problem hiding this comment.
Reviewed the integrations and tests, LGTM.
aad28b3 to
bfa56b0
Compare
Fix NAN issue by using getCurrentCUDAStream(). Apply rounding patch from trtllm (not needed for fixing NAN) Add fp4 unit tests Fix error spacing
Motivation
This PR adds support for modelopt FP4 quantized models.
Tested using fp4 quantized Llama 3.1 model.
This work was adapted from the following - thanks @pavanimajety @kaixih @kushanam!
vllm-project/vllm#12784
vllm-project/vllm#13571
vllm-project/vllm#12520
Modifications
Adds two operations to
sgl-kernel:Adds
modelopt_fp4quantization method.Adds
ModelOptFp4ConfigandModelOptFp4LinearMethodto utilize new fp4 kernels for linear layersChecklist