Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe project version in Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Bump version to 0.6.9OverviewSingle-line change to Versioning ConventionPer
The Observations
SummaryChange is correct, minimal, and follows project conventions. Main pre-merge checklist: all v0.6.9 gated PRs merged and tag |
|
to bot-run after #3158 then merge |
|
CI is grinding on many irrelevant problems, little gain waiting... release-v0.6.9 branch |
Bump version to 0.6.9 for release. https://github.com/flashinfer-ai/flashinfer/issues?q=is%3Aopen+label%3Av0.6.9 **API changes review** API changes since v0.6.8.post1 ```diff $ git diff v0.6.8.post1..main -- "*.py" | grep -B5 -A20 "@flashinfer_api" ... output = moe.run(x, x_sf, topk_ids, topk_weights, w1, w1_sf, ...) """ - @supported_compute_capability([100, 103]) + @supported_compute_capability([100, 103, 120, 121]) @flashinfer_api def __init__( self, @@ -388,7 +436,19 @@ class CuteDslMoEWrapper: self.device = device self.enable_pdl = enable_pdl - # Pre-allocated buffers + # Detect SM120 for architecture-specific dispatch + major, minor = torch.cuda.get_device_capability(device) + self._is_sm120 = major == 12 + if self._is_sm120: + from ...jit.cpp_ext import get_cuda_version + + if get_cuda_version().major < 13: + raise ValueError( + "SM120 CuTe DSL fused MoE requires CUDA 13 or later. " + f"Current CUDA version: {get_cuda_version()}." + ) + + # Pre-allocated buffers (SM100 path) -- ) -@supported_compute_capability([100, 103]) +@supported_compute_capability([100, 103, 120, 121]) @flashinfer_api def cute_dsl_fused_moe_nvfp4( x: torch.Tensor, @@ -712,7 +869,7 @@ def cute_dsl_fused_moe_nvfp4( ) -> torch.Tensor: """Run fused MoE computation using CuteDSL NVFP4 kernels. - Supported architectures: SM100, SM103. + Supported architectures: SM100, SM103, SM120, SM121. This is the simple functional API. For CUDA graph support, use `CuteDslMoEWrapper` instead. @@ -723,8 +880,12 @@ def cute_dsl_fused_moe_nvfp4( ... output = cute_dsl_fused_moe_nvfp4(...) Args: - x: Input tensor, NVFP4 quantized [num_tokens, hidden_size // 2]. - x_sf: Scale factors for x. + x: Input tensor. On SM100/SM103: NVFP4 quantized + [num_tokens, hidden_size // 2]. On SM120/SM121: bf16 + activations [num_tokens, hidden_size] (kernel fuses ``` **Summary of API changes:** - `CuteDslMoEWrapper.__init__` / `cute_dsl_fused_moe_nvfp4`: `@supported_compute_capability` widened from `[100, 103]` to `[100, 103, 120, 121]` (SM120 Blackwell support). **No signature change** — backward-compatible. - `gated_delta_rule_decode_pretranspose`: New optional parameter `output_state_indices: Optional[torch.Tensor] = None`. **Backward-compatible** (new param with default). - Internal: tactic pre-filtering in `core.py` for SM89→SM120 occupancy. No API surface change. - **No breaking changes detected.** <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Chores** * Version update to 0.6.9 (patch release) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Description
Bump version to 0.6.9 for release.
Related Issues (Gated-by PRs)
https://github.com/flashinfer-ai/flashinfer/issues?q=is%3Aopen+label%3Av0.6.9
Reviewer Notes
API changes review
API changes since v0.6.8.post1
$ git diff v0.6.8.post1..main -- "*.py" | grep -B5 -A20 "@flashinfer_api" ... output = moe.run(x, x_sf, topk_ids, topk_weights, w1, w1_sf, ...) """ - @supported_compute_capability([100, 103]) + @supported_compute_capability([100, 103, 120, 121]) @flashinfer_api def __init__( self, @@ -388,7 +436,19 @@ class CuteDslMoEWrapper: self.device = device self.enable_pdl = enable_pdl - # Pre-allocated buffers + # Detect SM120 for architecture-specific dispatch + major, minor = torch.cuda.get_device_capability(device) + self._is_sm120 = major == 12 + if self._is_sm120: + from ...jit.cpp_ext import get_cuda_version + + if get_cuda_version().major < 13: + raise ValueError( + "SM120 CuTe DSL fused MoE requires CUDA 13 or later. " + f"Current CUDA version: {get_cuda_version()}." + ) + + # Pre-allocated buffers (SM100 path) -- ) -@supported_compute_capability([100, 103]) +@supported_compute_capability([100, 103, 120, 121]) @flashinfer_api def cute_dsl_fused_moe_nvfp4( x: torch.Tensor, @@ -712,7 +869,7 @@ def cute_dsl_fused_moe_nvfp4( ) -> torch.Tensor: """Run fused MoE computation using CuteDSL NVFP4 kernels. - Supported architectures: SM100, SM103. + Supported architectures: SM100, SM103, SM120, SM121. This is the simple functional API. For CUDA graph support, use `CuteDslMoEWrapper` instead. @@ -723,8 +880,12 @@ def cute_dsl_fused_moe_nvfp4( ... output = cute_dsl_fused_moe_nvfp4(...) Args: - x: Input tensor, NVFP4 quantized [num_tokens, hidden_size // 2]. - x_sf: Scale factors for x. + x: Input tensor. On SM100/SM103: NVFP4 quantized + [num_tokens, hidden_size // 2]. On SM120/SM121: bf16 + activations [num_tokens, hidden_size] (kernel fusesSummary of API changes:
CuteDslMoEWrapper.__init__/cute_dsl_fused_moe_nvfp4:@supported_compute_capabilitywidened from[100, 103]to[100, 103, 120, 121](SM120 Blackwell support). No signature change — backward-compatible.gated_delta_rule_decode_pretranspose: New optional parameteroutput_state_indices: Optional[torch.Tensor] = None. Backward-compatible (new param with default).core.pyfor SM89→SM120 occupancy. No API surface change.Summary by CodeRabbit