Bug Description
vLLM v0.15.0 fails to run NVFP4 quantized MoE models on RTX Blackwell GPUs (compute capability 12.0, e.g., RTX PRO 6000 Blackwell Workstation Edition). The NVFP4 MoE backend selection code only checks for SM9.0 (Hopper) and SM10.x family (data center Blackwell B100/B200), but not SM12.0 (RTX Blackwell).
Error Message
ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the
deployment configuration since kernel does not support current device.
Root Cause
The device capability checks in the NVFP4 MoE backend selection code use is_device_capability_family(100) which only matches SM10.x:
# Current code - only checks family 100 (SM10.x)
current_platform.is_device_capability_family(100)
RTX Blackwell GPUs have compute capability 12.0 (SM120), which returns:
is_device_capability_family(100) → False (120 // 10 = 12 ≠ 100 // 10 = 10)
has_device_capability(100) → True (120 >= 100)
RTX Blackwell (SM12.0) shares the same native FP4/FP8 tensor core capabilities as data center Blackwell (SM10.0), so these kernels should work on both families.
Affected Files
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py - _supports_current_device(), _supports_quant_scheme()
vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py - _supports_current_device()
vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py - _supports_current_device()
vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py - _supports_current_device(), is_flashinfer_fp4_cutedsl_moe_available()
Proposed Fix
Add is_device_capability_family(120) checks alongside the existing family(100) checks:
# Fix - check both SM10.x and SM12.x families
current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)
Environment
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM12.0)
- vLLM Version: v0.15.0
- Model: MiniMax-M2.1-NVFP4
- CUDA: 13.0.2
- Driver: 580.126.09
Steps to Reproduce
- Run vLLM v0.15.0 on an RTX Blackwell GPU (SM12.0)
- Load any NVFP4 quantized MoE model (e.g., MiniMax-M2.1-NVFP4)
- Observe the error during model initialization
Additional Context
This worked in v0.14.0 because the stricter device capability family checks were introduced in v0.15.0 (commit 42135d6).
Bug Description
vLLM v0.15.0 fails to run NVFP4 quantized MoE models on RTX Blackwell GPUs (compute capability 12.0, e.g., RTX PRO 6000 Blackwell Workstation Edition). The NVFP4 MoE backend selection code only checks for SM9.0 (Hopper) and SM10.x family (data center Blackwell B100/B200), but not SM12.0 (RTX Blackwell).
Error Message
Root Cause
The device capability checks in the NVFP4 MoE backend selection code use
is_device_capability_family(100)which only matches SM10.x:RTX Blackwell GPUs have compute capability 12.0 (SM120), which returns:
is_device_capability_family(100)→ False (120 // 10 = 12 ≠ 100 // 10 = 10)has_device_capability(100)→ True (120 >= 100)RTX Blackwell (SM12.0) shares the same native FP4/FP8 tensor core capabilities as data center Blackwell (SM10.0), so these kernels should work on both families.
Affected Files
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py-_supports_current_device(),_supports_quant_scheme()vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py-_supports_current_device()vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py-_supports_current_device()vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py-_supports_current_device(),is_flashinfer_fp4_cutedsl_moe_available()Proposed Fix
Add
is_device_capability_family(120)checks alongside the existingfamily(100)checks:Environment
Steps to Reproduce
Additional Context
This worked in v0.14.0 because the stricter device capability family checks were introduced in v0.15.0 (commit 42135d6).