Skip to content

[Bug] NVFP4 MoE kernels fail on RTX Blackwell (SM12.0) - device capability family check missing SM120 #33416

@renehonig

Description

@renehonig

Bug Description

vLLM v0.15.0 fails to run NVFP4 quantized MoE models on RTX Blackwell GPUs (compute capability 12.0, e.g., RTX PRO 6000 Blackwell Workstation Edition). The NVFP4 MoE backend selection code only checks for SM9.0 (Hopper) and SM10.x family (data center Blackwell B100/B200), but not SM12.0 (RTX Blackwell).

Error Message

ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the 
deployment configuration since kernel does not support current device.

Root Cause

The device capability checks in the NVFP4 MoE backend selection code use is_device_capability_family(100) which only matches SM10.x:

# Current code - only checks family 100 (SM10.x)
current_platform.is_device_capability_family(100)

RTX Blackwell GPUs have compute capability 12.0 (SM120), which returns:

  • is_device_capability_family(100)False (120 // 10 = 12 ≠ 100 // 10 = 10)
  • has_device_capability(100)True (120 >= 100)

RTX Blackwell (SM12.0) shares the same native FP4/FP8 tensor core capabilities as data center Blackwell (SM10.0), so these kernels should work on both families.

Affected Files

  • vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py - _supports_current_device(), _supports_quant_scheme()
  • vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py - _supports_current_device()
  • vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py - _supports_current_device()
  • vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py - _supports_current_device(), is_flashinfer_fp4_cutedsl_moe_available()

Proposed Fix

Add is_device_capability_family(120) checks alongside the existing family(100) checks:

# Fix - check both SM10.x and SM12.x families
current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)

Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM12.0)
  • vLLM Version: v0.15.0
  • Model: MiniMax-M2.1-NVFP4
  • CUDA: 13.0.2
  • Driver: 580.126.09

Steps to Reproduce

  1. Run vLLM v0.15.0 on an RTX Blackwell GPU (SM12.0)
  2. Load any NVFP4 quantized MoE model (e.g., MiniMax-M2.1-NVFP4)
  3. Observe the error during model initialization

Additional Context

This worked in v0.14.0 because the stricter device capability family checks were introduced in v0.15.0 (commit 42135d6).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions