Fix cuda graph capture dtype mismatch (#17)

JustinTong0323 · cursoragent · JustinTong0323 · commit fc80fbdd6a53 · 2025-10-29T21:13:01.000Z
* Fix dtype mismatch in rotary embedding with FP8 KV cache

When using FP8 KV cache quantization (e.g., with ModelOpt FP8 models),
the query and key tensors may have different dtypes during CUDA graph
capture. The query tensor remains in bfloat16 for computation, while
the key tensor might need to be in FP8 format for KV cache storage.

The issue was in DeepseekScalingRotaryEmbedding.forward_native() which
only captured query's dtype and then converted both query and key to
that same dtype. This caused a dtype mismatch error during CUDA graph
capture: "query and key must have the same dtype".

The fix preserves the original dtypes of both query and key tensors
separately, ensuring they maintain their intended dtypes after the
rotary position embedding computation.

This resolves the CUDA graph capture failure with Qwen3MoE and other
models using FP8 KV cache quantization.

* Fix FA4 dtype mismatch with FP8 KV cache

When using FlashAttention 4 (FA4) with FP8 KV cache quantization,
there was a dtype mismatch between the query tensor (bfloat16) and
the cached key/value tensors (FP8). FA4 requires all input tensors
(q, k, v) to have the same dtype.

The previous code only converted the query to FP8 when NOT using FA4
(fa_impl_ver != 4). This was based on the assumption that FA4 doesn't
support FP8, but actually FA4 CAN work with FP8 tensors as long as
all tensors have matching dtypes.

The key difference is that FA4 doesn't support descale parameters for
on-the-fly dequantization (unlike FA3). So we:
1. Convert query to FP8 to match the KV cache dtype for both FA3 and FA4
2. Only set k_descale/v_descale for FA3 (FA4 doesn't support them)

This resolves the "query and key must have the same dtype" error when
using FP8 KV cache with FA4.

---------

Co-authored-by: Cursor Agent &lt;cursoragent@cursor.com&gt;
diff --git a/python/sglang/srt/layers/attention/flashattention_backend.py b/python/sglang/srt/layers/attention/flashattention_backend.py
@@ -693,16 +693,18 @@ def forward_extend(
         # only use kv scaling if: 1) fp8 kv is explicitly enabled, 2) RadixAttention
         # has corresponding quantization method so that layer.k_scale is not None,
         # 3) layer.head_dim <= 256 since fa3 kernel require fp16 and bf16 data type in this case,
-        # 4) fa_impl_ver != 4 since fa4 does not currently support fp8 queries and keys.
+        # 4) fa_impl_ver != 4 since fa4 does not support descale parameters (but FA4 can work with FP8 if all tensors have matching dtypes).
         if (
             self.kv_cache_dtype_str != "auto"
             and layer.head_dim <= 256
-            and self.fa_impl_ver != 4
         ):
-            if layer.k_scale is not None:
-                descale_shape = (forward_batch.batch_size, layer.tp_k_head_num)
-                k_descale = layer.k_scale.expand(descale_shape)
-                v_descale = layer.v_scale.expand(descale_shape)
+            if self.fa_impl_ver != 4:
+                # For FA3, use descale parameters for on-the-fly dequantization
+                if layer.k_scale is not None:
+                    descale_shape = (forward_batch.batch_size, layer.tp_k_head_num)
+                    k_descale = layer.k_scale.expand(descale_shape)
+                    v_descale = layer.v_scale.expand(descale_shape)
+            # Convert query to FP8 to match KV cache dtype (required for FA4, optional for FA3)
             q = q.to(self.kv_cache_dtype)
             q_rope = q_rope.to(self.kv_cache_dtype) if q_rope is not None else None
             k_rope = k_rope.to(self.kv_cache_dtype) if k_rope is not None else None
diff --git a/python/sglang/srt/layers/rotary_embedding.py b/python/sglang/srt/layers/rotary_embedding.py
@@ -816,7 +816,8 @@ def forward_native(
         offsets: Optional[torch.Tensor] = None,
     ) -> Tuple[torch.Tensor, torch.Tensor]:
         """PyTorch-native implementation equivalent to forward()."""
-        dtype = query.dtype
+        query_dtype = query.dtype
+        key_dtype = key.dtype
         query_rot = query[..., : self.rotary_dim]
         key_rot = key[..., : self.rotary_dim]
         if self.rotary_dim < self.head_size:
@@ -847,7 +848,7 @@ def forward_native(
         else:
             query = query_rot
             key = key_rot
-        return query.to(dtype), key.to(dtype)
+        return query.to(query_dtype), key.to(key_dtype)
 
     def forward_npu(
         self,