Conversation
📝 WalkthroughWalkthroughThe project version was incremented from 0.6.7 to 0.6.8 in the version.txt file. This is a standard version bump with no functional code changes. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~1 minute Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
PR Review: Version Bump to 0.6.8 The change itself is correct — a single-line edit to version.txt (0.6.7 to 0.6.8). This is the right place for the version string (read by build_backend.py), and the patch increment is consistent with the project versioning scheme. Open issues labeled v0.6.8: Two bugs remain open at time of review and both are labeled for this milestone: Issue 3029 — test_trtllm_gen_attention.py fails with AssertionError on GB200 (head_dim=256 precision regression, suspected introduced by PR 2988). Risk: Medium — correctness regression in a shipped attention path. PR 2988 already skipped xqa+head_dim=256 for a precision issue; this suggests the underlying problem is broader than that skip. Recommendation: fix the precision or explicitly skip trtllm-gen + head_dim=256 with a tracking comment before releasing. Issue 3030 — test_prefill_delta_rule OOM-kills nvcc on H100 during JIT compilation (suspected introduced by PR 2908, which doubled kernel variants from 32 to 64, creating excessive parallel nvcc memory pressure; exit code 137). Risk: Low-medium — affects CI reliability and user JIT builds on memory-constrained machines. Recommendation: the suggested fix (cap MAX_JOBS in CI) is low-risk and easy to land before the release tag. Minor notes: no changelog entry is updated, but reviewing past version-bump PRs suggests that is intentional for this repo. auto-merge is disabled — worth verifying CI is fully green before enabling. Overall the mechanical change is clean. The main question is whether issue 3029 (correctness) and issue 3030 (OOM/CI reliability) are considered blocking for the 0.6.8 release. |
|
/bot run |
|
H100: =========== 1 failed, 1462 passed, 580 skipped in 363.49s (0:06:03) ============ This is tracked in issue #3030 already |
Description
Bump version to 0.6.8 for release.
Related Issues (Gated-by PRs)
https://github.com/flashinfer-ai/flashinfer/issues?q=is%3Aopen+label%3Av0.6.8
Reviewer Notes
API changes review
API changes since v0.6.7.post3
prefill.py
BatchPrefillWithPagedKVCacheWrapper.run()andtrtllm_batch_context_with_kv_cache()overload stubs fall outside the grep window above:$ git diff v0.6.7.post3..main -- 'flashinfer/prefill.py' | grep -B5 -A10 'kv_block_scales|kv_cache_sf' if backend == "cudnn": @@ -2098,9 +2104,7 @@ class BatchPrefillWithPagedKVCacheWrapper: enable_pdl: Optional[bool] = None, window_left: Optional[int] = None, sinks: Optional[torch.Tensor] = None, - kv_block_scales: Optional[ - Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] - ] = None, + kv_cache_sf: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, skip_softmax_threshold_scale_factor: Optional[float] = None, ) -> torch.Tensor: ... @@ -2118,9 +2122,7 @@ class BatchPrefillWithPagedKVCacheWrapper: enable_pdl: Optional[bool] = None, window_left: Optional[int] = None, sinks: Optional[torch.Tensor] = None, - kv_block_scales: Optional[ - Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] - ] = None, + kv_cache_sf: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, skip_softmax_threshold_scale_factor: Optional[float] = None, ) -> Tuple[torch.Tensor, torch.Tensor]: ... @@ -2139,9 +2141,7 @@ class BatchPrefillWithPagedKVCacheWrapper: enable_pdl: Optional[bool] = None, window_left: Optional[int] = None, sinks: Optional[torch.Tensor] = None, - kv_block_scales: Optional[ - Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] - ] = None, + kv_cache_sf: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, skip_softmax_threshold_scale_factor: Optional[float] = None, ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]: r"""Compute batch prefill/append attention between query and paged kv-cache. @@ -2181,6 +2181,21 @@ class BatchPrefillWithPagedKVCacheWrapper: enable_pdl : bool Whether to enable Programmatic Dependent Launch (PDL). See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization Only supported for >= sm90, and currently only for FA2 and CUDA core decode. + kv_cache_sf : Optional[Tuple[torch.Tensor, torch.Tensor]] + Per-block scale factors for NVFP4 KV cache, as a tuple of ``(k_scales, v_scales)``. + Scale tensors must follow the same :attr:`kv_layout` as the KV cache: + + * **HND**: ``[num_pages, num_kv_heads, page_size, head_dim // 16]`` + * **NHD**: ``[num_pages, page_size, num_kv_heads, head_dim // 16]`` + + Both tensors have dtype ``torch.float8_e4m3fn``. ``k_scales`` uses a linear + (row-major) layout, while ``v_scales`` must use TRT-LLM's 4-token interleaved + layout within each ``[page_size, head_dim // 16]`` tile. Use + :func:`flashinfer.fp4_quantization.nvfp4_quantize_paged_kv_cache` to produce -- Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] @@ -2212,14 +2227,22 @@ class BatchPrefillWithPagedKVCacheWrapper: f"where total_tokens = qo_indptr[-1]." ) - # Unpack kv_block_scales + if ( + k_cache.dtype == torch.uint8 or v_cache.dtype == torch.uint8 + ) and kv_cache_sf is None: + raise ValueError("kv_cache_sf must be provided for NVFP4 KV cache.") key_block_scales = None value_block_scales = None - if kv_block_scales is not None: - if isinstance(kv_block_scales, tuple): - key_block_scales, value_block_scales = kv_block_scales - else: - key_block_scales, value_block_scales = kv_block_scales.unbind(dim=1) + if kv_cache_sf is not None: + if ( + not isinstance(kv_cache_sf, (tuple, list)) + or len(kv_cache_sf) != 2 + or not all(torch.is_tensor(x) for x in kv_cache_sf) + ): + raise TypeError( + "kv_cache_sf must be a tuple/list of two tensors: (k_scales, v_scales)." + ) + key_block_scales, value_block_scales = kv_cache_sf o_dtype = self._cached_o_data_type if out is not None and out.dtype != o_dtype: @@ -2265,7 +2288,7 @@ class BatchPrefillWithPagedKVCacheWrapper: # For NVFP4 KV (uint8 packed), v_cache last dim is head_dim//2; # use q's head_dim for output instead - out_head_dim = q.shape[-1] if kv_block_scales is not None else v_cache.shape[-1] + out_head_dim = q.shape[-1] if kv_cache_sf is not None else v_cache.shape[-1] if out is None: # Use cached output data type if available (for FP8 attention with FP16 output) out_dtype = getattr(self, "_cached_o_data_type", None) or q.dtype @@ -2355,7 +2378,19 @@ class BatchPrefillWithPagedKVCacheWrapper: enable_pdl, ] if self._jit_module is not None: - run_args.extend(list(args)) + run_args.extend( + prepare_jit_additional_args( -- attention_sinks: Optional[torch.Tensor] = None, @@ -3731,9 +3781,7 @@ def trtllm_batch_context_with_kv_cache( kv_layout: str = "HND", enable_pdl: Optional[bool] = None, sinks: Optional[List[torch.Tensor]] = None, - kv_block_scales: Optional[ - Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] - ] = None, + kv_cache_sf: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, skip_softmax_threshold_scale_factor: Optional[float] = None, uses_shared_paged_kv_idx: bool = True, ) -> Union[torch.Tensor, FP4Tensor]: @@ -3800,11 +3848,21 @@ def trtllm_batch_context_with_kv_cache( data copy overhead. Use ``HND`` for better performance. sinks : Optional[List[torch.Tensor]] = None additional value per head in the denominator of the softmax. - kv_block_scales : Optional[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]] = None - Per-block scale factors for NVFP4 KV cache. Either a tuple of (k_scales, v_scales) or - a single tensor with shape ``[num_pages, 2, ...]`` that will be unbound along dim=1. - Each scale tensor has shape ``[num_pages, num_kv_heads, page_size, head_dim // 16]`` - in HND layout, with dtype ``torch.float8_e4m3fn``. + kv_cache_sf : Optional[Tuple[torch.Tensor, torch.Tensor]] = None + Per-block scale factors for NVFP4 KV cache, as a tuple of ``(k_scales, v_scales)``. + Scale tensors must follow the same :attr:`kv_layout` as the KV cache: + + * **HND**: ``[num_pages, num_kv_heads, page_size, head_dim // 16]`` + * **NHD**: ``[num_pages, page_size, num_kv_heads, head_dim // 16]`` + + Both tensors have dtype ``torch.float8_e4m3fn``. ``k_scales`` uses a linear + (row-major) layout, while ``v_scales`` must use TRT-LLM's 4-token interleaved + layout within each ``[page_size, head_dim // 16]`` tile. Use + :func:`flashinfer.fp4_quantization.nvfp4_quantize_paged_kv_cache` to produce -- @@ -3845,20 +3903,22 @@ def trtllm_batch_context_with_kv_cache( # it doesn't change underlying storage k_cache, v_cache = kv_cache.unbind(dim=1) - # Unpack kv_block_scales + if ( + k_cache.dtype == torch.uint8 or v_cache.dtype == torch.uint8 + ) and kv_cache_sf is None: + raise ValueError("kv_cache_sf must be provided for NVFP4 KV cache.") key_block_scales = None value_block_scales = None - if kv_block_scales is not None: - if isinstance(kv_block_scales, tuple): - key_block_scales, value_block_scales = kv_block_scales - else: - if kv_block_scales.shape[1] == 1: - key_block_scales, value_block_scales = kv_block_scales, kv_block_scales - else: - assert kv_block_scales.shape[1] == 2, ( - "When kv_block_scales is a single tensor, the second dimension must be 1 or 2" - ) - key_block_scales, value_block_scales = kv_block_scales.unbind(dim=1) + if kv_cache_sf is not None: + if ( + not isinstance(kv_cache_sf, (tuple, list)) + or len(kv_cache_sf) != 2 + or not all(torch.is_tensor(x) for x in kv_cache_sf) + ): + raise TypeError( + "kv_cache_sf must be a tuple/list of two tensors: (k_scales, v_scales)." + ) + key_block_scales, value_block_scales = kv_cache_sf # Convert NHD layout to HND if necessary if kv_layout == "NHD":