[Kernel] Add NVFP4 MoE CUTLASS support for SM120#29242
[Kernel] Add NVFP4 MoE CUTLASS support for SM120#29242vllm-bot merged 13 commits intovllm-project:mainfrom
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for NVFP4 MoE kernels on SM120 architecture using CUTLASS. The changes include adding the new kernel file to the build system, implementing the SM120-specific kernel, and creating a dispatcher to select the correct kernel based on the SM version.
The implementation for the SM120 kernel introduces significant code duplication with the existing SM100 kernel. I've left a comment suggesting a refactoring to improve maintainability by using templates to abstract away the architecture-specific details, similar to patterns seen elsewhere in the codebase. Other changes look good.
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
|
Debugging some things here locally with trace output enabled: I wonder if we're going to end up needed some of the fixes from CUTLASS 4.3.0 such as https://github.com/NVIDIA/cutlass/pull/2789/files#diff-17feae83c50669a768a86d30b7bd9c50feb773c10c38b85b253f15e7462283d0 or https://github.com/NVIDIA/cutlass/pull/2789/files#diff-b743e12813d0ca6abe40a6711013c612bb15f7affeba65c7d7d4392d7e3d406d |
|
So, after some digging around, I got this seemingly working by adjusting |
|
Wow, I couldn't figure out the gemm initialization failure for several hours yesterday. Even CUTLASS debug logs were giving me nothing. If that fixes the issue for me, thank you so much. |
|
I tried explicit template instantiation and that did not work. I tried static inline template and non-templated Here's the gist with what seems to work for me - https://gist.github.com/bbrowning/fb0aac4970a881eba26384d262cd41c9. That's entirely untested on anything but a DGX Spark and the model RedHatAI/Qwen3-30B-A3B-NVFP4. Perhaps a macro could reduce code duplication there. |
|
I don't know what to really expect performance-wise, but this feels like it's taking advantage of the new NVFP4 kernels? The output token/s is higher than I expected... |
Signed-off-by: mgoin <mgoin64@gmail.com>
|
Documentation preview: https://vllm--29242.org.readthedocs.build/en/29242/ |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
I pulled the latest changes here onto my SM121 and was able to load RedHatAI/Qwen3-30B-A3B-NVFP4 and RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4 (both NVFP4 MoE models) successfully without having to make any other local changes.Thanks! |
| fusion_args.alpha_ptr_array = | ||
| reinterpret_cast<float**>(alpha_ptrs.data_ptr()); | ||
| fusion_args.dAlpha = {_0{}, _0{}, 1}; | ||
| fusion_args.beta = 0.0f; |
There was a problem hiding this comment.
Does the sm100 version need beta set to 0 also?
There was a problem hiding this comment.
Doesn't seem to be so, the sm120 kernel is a bit unique as seen from the bf16 issue
bnellnm
left a comment
There was a problem hiding this comment.
LGTM. I'm not a cutlass expert but it looks close enough to the sm100 version.
| "Failed to initialize GEMM: status=", (int)status, | ||
| " workspace_size=", workspace_size, " num_experts=", num_experts, | ||
| " M=", M, " N=", N, " K=", K); |
There was a problem hiding this comment.
nit: Would it be useful to add a similar detailed message to the SM100 kernel?
There was a problem hiding this comment.
I want to improve this across the board for our cutlass kernels, so will do in a followup
|
lgtm also! |
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>
Purpose
Expand
csrc/quantization/fp4/nvfp4_experts_quant.cuandcsrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cuto build for SM120 (RTX 50xx and RTX PRO 6000 Blackwell) so we can support nvfp4 cutlass moe on the platform.Thanks to @AichenF in sgl-project/sglang#11737 for the extension to the SM100 kernel.
Hoping to address #29030, #29141
Test Plan
Eval on RTX 5090
Test Result
Main:
Fails to run due to various kernels not built on SM120 but the cutlass path is taken
PR:
Test compressed-tensors checkpoint https://huggingface.co/RedHatAI/Qwen3-30B-A3B-NVFP4
Test modelopt checkpoint https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.