[WIP] Refactor: simplify torch -> cute-dsl boilerplate and enable tvm-ffi for cute-dsl kernels by yzh119 · Pull Request #2279 · flashinfer-ai/flashinfer

yzh119 · 2026-01-01T06:54:29Z

📌 Description

cute-dsl adds support of compiling with tvm-ffi since 4.3 release https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/compile_with_tvm_ffi.html, which allows user to pass torch tensors directly with negligible dlpack conversion cost, without the need of manually creating cute tensors from cute pointer.

In this PR we refactored the existing cute-dsl kernels to enable tvm-ffi and simplify the torch -> cute-dsl boilerplate.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Refactor
- FP4 quant kernels (RMSNorm and Add+RMSNorm) now accept TVM-FFI tensors and a generic stream instead of raw pointers, simplifying invocation and runtime flow and improving handling of swizzled vs non‑swizzled scale layouts.
- Compilation path updated to use TVM-FFI-friendly scaffolding with symbolic/fake tensors and streams.
Documentation
- Docstrings and user-facing notes updated to describe the tensor-based inputs, TVM-FFI usage, and swizzle-dependent layout behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-01T06:54:39Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This change converts two RMSNorm FP4-quant kernels from pointer-based host/kernel bindings to TVM-FFI tensor-based invocation, updating host signatures, compilation scaffolding (fake/symbolic tensors and M), and runtime launch flows with swizzle-aware scale handling.

Changes

Cohort / File(s)	Summary
Kernel Interface Migration `flashinfer/cute_dsl/add_rmsnorm_fp4quant.py`, `flashinfer/cute_dsl/rmsnorm_fp4quant.py`	Replaced pointer parameters (`x_ptr`, `r_ptr`, `w_ptr`, `y_ptr`, `s_ptr`, `global_scale_ptr`) with tensor parameters (`mX`, `mR`, `mW`, `mY`, `mS`, `mGlobalScale`). Updated host `__call__` signatures to accept `cute.Tensor` objects and a generic stream.
Compilation Scaffolding `flashinfer/cute_dsl/add_rmsnorm_fp4quant.py`, `flashinfer/cute_dsl/rmsnorm_fp4quant.py`	Replaced manual pointer provisioning with fake/symbolic TVM-FFI tensors and a fake stream; switched compile invocation to TVM-FFI-enabled path and used symbolic `M` for shape during compile.
Runtime Launch & Tensor Handling `flashinfer/cute_dsl/add_rmsnorm_fp4quant.py`, `flashinfer/cute_dsl/rmsnorm_fp4quant.py`	Runtime now passes real `cute.Tensor` objects directly via TVM-FFI; removed CUDA driver / pointer marshalling code; scale tensor flatten/contiguate logic adjusted based on swizzle mode.
Docs & Docstrings `flashinfer/cute_dsl/add_rmsnorm_fp4quant.py`, `flashinfer/cute_dsl/rmsnorm_fp4quant.py`	Expanded/updated docstrings and comments to document TVM-FFI tensor inputs, symbolic compilation fixtures, and swizzle-dependent layout behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host Code
    participant TVM as TVM-FFI
    participant Compiler as Kernel Compiler
    participant CUDA as CUDA Runtime

    rect rgb(245,250,255)
    Note over Host,Compiler: Compile-time (fake/symbolic tensors)
    Host->>Host: Create symbolic M & fake `cute.Tensor` fixtures + fake stream
    Host->>Compiler: Invoke compiler with TVM-FFI enabled and fake tensors
    Compiler->>TVM: Register tensor signatures for TVM-FFI
    Compiler->>CUDA: Emit compiled kernel artifact
    end

    rect rgb(245,255,245)
    Note over Host,CUDA: Runtime (real tensor passing)
    Host->>Host: Prepare real tensors (mX,mW,mY,mS[,mR,mGlobalScale])\nflatten/contiguate scale if swizzled
    Host->>TVM: Pass real tensors + stream via TVM-FFI
    TVM->>CUDA: Launch kernel with tensor-backed inputs
    CUDA->>CUDA: Execute kernel on device memory
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

aleozlx
bkryu
jimmyzho

Poem

🐇 I hop in bits and tensors bright,
Pointers fade into TVM light.
Fake M, soft streams for compile-time cheer,
Real tensors race when runtime's near.
Hooray — the rabbit bakes a byte! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: refactoring cute-dsl kernels to enable tvm-ffi support and simplify torch-to-cute-dsl conversion boilerplate.
Description check	✅ Passed	The description includes the required sections (📌 Description and 🔍 Related Issues) and adequately explains the rationale, context, and purpose of the changes with reference to the CUTLASS documentation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c74f9e and 4c1e2a5.

📒 Files selected for processing (2)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
flashinfer/cute_dsl/rmsnorm_fp4quant.py

🧰 Additional context used

📓 Path-based instructions (1)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
flashinfer/cute_dsl/rmsnorm_fp4quant.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

Applied to files:

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

🧬 Code graph analysis (2)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (2)

csrc/tvm_ffi_utils.h (1)

Tensor (304-306)

include/flashinfer/trtllm/fused_moe/runner.h (1)

hidden_size (265-265)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (2)

csrc/tvm_ffi_utils.h (1)

Tensor (304-306)

include/flashinfer/trtllm/fused_moe/runner.h (1)

hidden_size (265-265)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (6)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (3)

1001-1020: LGTM! TVM-FFI tensor-based interface correctly implemented.

The signature has been properly updated from pointer-based inputs to tensor-based inputs via TVM-FFI, and the docstring accurately reflects this change. All parameters are correctly documented with their tensor types and shapes.

1698-1701: LGTM! Docstring accurately describes TVM-FFI integration.

The updated docstring clearly communicates the use of TVM-FFI for efficient tensor passing, which aligns with the PR objectives.

1780-1794: LGTM! Runtime API correctly passes tensors via TVM-FFI.

The tensor handling is appropriate:

Scale tensor is flattened for swizzled layout or made contiguous for row-major layout

Output tensor is correctly viewed as uint8 to match kernel expectations

All tensors are passed in the correct order with proper type conversions

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (3)

1006-1027: LGTM! TVM-FFI tensor-based interface correctly implemented.

The signature has been properly updated from pointer-based inputs to tensor-based inputs via TVM-FFI. The docstring accurately reflects this change, and notably, the previous issue about claiming in-place update of mR has been resolved—the current docstring correctly describes mR as a read-only input tensor.

2145-2149: LGTM! Docstring accurately describes TVM-FFI integration.

The updated docstring clearly communicates the use of TVM-FFI for efficient tensor passing, which aligns with the PR objectives.

2231-2246: LGTM! Runtime API correctly passes tensors via TVM-FFI.

The tensor handling is appropriate:

Scale tensor is flattened for swizzled layout or made contiguous for row-major layout

Output tensor is correctly viewed as uint8 to match kernel expectations

All tensors are passed in the correct order with proper type conversions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-01T06:54:44Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the existing cute-dsl kernels, specifically for RMSNorm FP4 quantization, to leverage TVM-FFI. This integration aims to simplify the interaction between PyTorch tensors and the underlying CUDA kernels by allowing direct tensor passing, thereby reducing boilerplate code and improving the overall efficiency and developer experience when working with CUTLASS's cute-dsl.

Highlights

TVM-FFI Integration: The core change enables TVM-FFI (Foreign Function Interface) for cute-dsl kernels, allowing direct passing of PyTorch tensors to CUDA kernels without manual pointer conversions, significantly simplifying the boilerplate.
Simplified Kernel Interface: The __call__ methods within the AddRMSNormFP4QuantKernel and RMSNormFP4QuantKernel classes have been refactored to directly accept cute.Tensor objects, eliminating the need for internal cute.make_tensor calls from raw pointers.
Reduced Boilerplate: Removed cuda.bindings.driver, cutlass.torch, and the custom make_ptr utility, streamlining the codebase and reducing dependencies related to manual pointer handling.
Dynamic Tensor Compilation: The kernel compilation process now uses cute.runtime.make_fake_compact_tensor with symbolic sizes (cute.sym_int()) and cute.runtime.make_fake_stream with use_tvm_ffi_env_stream=True, enabling more flexible and efficient compilation for dynamic tensor shapes via TVM-FFI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the cute-dsl kernels to leverage tvm-ffi, which is a significant improvement. By enabling tvm-ffi, the code is simplified by allowing torch.Tensor objects to be passed directly to the kernels, removing the boilerplate for manual pointer creation and management. The changes in add_rmsnorm_fp4quant.py and rmsnorm_fp4quant.py are consistent and correctly use cute.runtime.make_fake_compact_tensor with symbolic dimensions for compilation. My review includes a couple of suggestions to correct misleading comments for better code clarity. Overall, this is a great change that improves maintainability.

gemini-code-assist · 2026-01-01T06:56:29Z

+    # Scale factor tensor layout depends on swizzle mode
+    if is_sf_swizzled_layout:
+        # For swizzled mode, use 1D layout - the swizzle pattern is computed in kernel
+        # Size is: num_m_tiles * num_k_tiles * 512, which is independent of M


The comment incorrectly states that the swizzled size is independent of M. The number of M-tiles (num_m_tiles) is derived from M (the batch dimension), so the total swizzled size is dependent on M. The implementation correctly uses a symbolic integer for this dynamic size, but the comment is misleading and should be corrected for clarity.

Suggested change

# Size is: num_m_tiles * num_k_tiles * 512, which is independent of M

# Size is `num_m_tiles * num_k_tiles * 512`, which depends on the `M` dimension.

gemini-code-assist · 2026-01-01T06:56:30Z

+    # Scale factor tensor layout depends on swizzle mode
+    if is_sf_swizzled_layout:
+        # For swizzled mode, use 1D layout - the swizzle pattern is computed in kernel
+        # Size is: num_m_tiles * num_k_tiles * 512, which is independent of M


This comment is misleading. The swizzled size is dependent on M because num_m_tiles is calculated based on M. While the code correctly uses a symbolic size, the comment should be updated to reflect this dependency to avoid confusion.

Suggested change

# Size is: num_m_tiles * num_k_tiles * 512, which is independent of M

# Size is `num_m_tiles * num_k_tiles * 512`, which depends on the `M` dimension.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (1)
1658-1661: Outdated section header.

The section header on line 1659 says "Pointer-based Compilation" but the code now uses tensor-based TVM-FFI compilation. This should be updated for consistency.
Suggested fix
 # =============================================================================
-# PyTorch API Functions - Streamlined with Pointer-based Compilation
+# PyTorch API Functions - Streamlined with TVM-FFI Tensor Compilation
 # =============================================================================

🧹 Nitpick comments (2)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (1)
1706-1713: Minor: Misleading comment about M-independence.

The comment states the swizzled size "is independent of M", but num_m_tiles = ceil(M / 128), so the size actually depends on M. The implementation using a separate symbolic variable is correct, but the comment is confusing.
Suggested fix
     if is_sf_swizzled_layout:
         # For swizzled mode, use 1D layout - the swizzle pattern is computed in kernel
-        # Size is: num_m_tiles * num_k_tiles * 512, which is independent of M
-        # Use a separate symbolic variable for this size
+        # Size is: num_m_tiles * num_k_tiles * 512
+        # Use a separate symbolic variable since this has different shape semantics
         sym_swizzled_size = cute.sym_int()
flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (1)

39-671: Consider extracting shared intrinsics and utilities to a common module.

Both rmsnorm_fp4quant.py and add_rmsnorm_fp4quant.py share substantial duplicate code (~800+ lines):

PTX intrinsics (set_block_rank, store_shared_remote, ld_global_v4_u32, etc.)

Half2/BFloat2 SIMD intrinsics

FP8/UE8M0 conversion intrinsics

Reduction utilities (warp_reduce, block_reduce, cluster_reduce)

get_sm_version function

Extracting these to a shared module (e.g., flashinfer/cute_dsl/intrinsics.py) would reduce maintenance burden and ensure consistency.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f1624c and db694aa.

📒 Files selected for processing (2)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
flashinfer/cute_dsl/rmsnorm_fp4quant.py

🧰 Additional context used

📓 Path-based instructions (1)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
flashinfer/cute_dsl/rmsnorm_fp4quant.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

Applied to files:

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

🧬 Code graph analysis (2)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (2)

csrc/tvm_ffi_utils.h (1)

Tensor (304-306)

include/flashinfer/trtllm/fused_moe/runner.h (1)

hidden_size (265-265)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (2)

csrc/tvm_ffi_utils.h (1)

Tensor (304-306)

include/flashinfer/trtllm/fused_moe/runner.h (1)

hidden_size (265-265)

🔇 Additional comments (5)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (3)

1001-1018: LGTM - Kernel interface refactored to tensor-based API.

The signature change from pointer-based to tensor-based inputs aligns with the TVM-FFI refactoring objective. The docstrings accurately describe the expected tensor shapes and layouts.

1739-1760: LGTM - Runtime tensor API correctly handles tensor passing.

The tensor_api closure appropriately handles the scale tensor layout (flatten for swizzled, contiguous for non-swizzled). The caller rmsnorm_fp4quant ensures input tensors are contiguous before passing them.

1726-1737: LGTM - TVM-FFI compilation setup.

The compilation correctly uses a fake stream with use_tvm_ffi_env_stream=True to capture the environment stream at runtime, and enables TVM-FFI with the --enable-tvm-ffi option.

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (2)

2103-2171: LGTM - TVM-FFI compilation setup consistent with rmsnorm_fp4quant.py.

The fake tensor creation and compilation approach is consistent with the pattern in rmsnorm_fp4quant.py. The implementation correctly uses symbolic dimensions and TVM-FFI options.

Note: The same minor comment about "independent of M" at line 2141 applies here as well.

2173-2196: LGTM - Runtime tensor API correctly handles tensor passing.

The tensor_api closure appropriately handles the scale tensor layout and matches the pattern from rmsnorm_fp4quant.py.

bkryu · 2026-01-02T01:41:14Z

Thank @yzh119, the previous torch -> cute-dsl overhead was a bit of a pain and this addresses the issue well. This PR would be helpful in preparing future cute-dsl kernels' APIs to correctly use tvm-ffi

Once the merge conflict with #2260 has been resolved, I can review.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

…er into cute-dsl-refactor

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (1)

1733-1734: Comment is misleading about M-independence.

The swizzled size depends on M because num_m_tiles = (batch_size + 127) // 128 is computed from the batch dimension. While the code correctly uses a symbolic variable, the comment should be corrected.

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (1)

2182-2183: Comment incorrectly claims M-independence.

Same issue as in rmsnorm_fp4quant.py: the swizzled size depends on M via num_m_tiles.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f321ce3 and 2c74f9e.

📒 Files selected for processing (2)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py
flashinfer/cute_dsl/rmsnorm_fp4quant.py

🧰 Additional context used

📓 Path-based instructions (1)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/cute_dsl/rmsnorm_fp4quant.py
flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to csrc/**/*.cu : Framework bindings and PyTorch tensor handling should be implemented in `csrc/` via TVM-FFI, not in `include/` headers

Applied to files:

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

🧬 Code graph analysis (1)

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (2)

csrc/tvm_ffi_utils.h (1)

Tensor (304-306)

include/flashinfer/trtllm/fused_moe/runner.h (1)

hidden_size (265-265)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (4)

flashinfer/cute_dsl/rmsnorm_fp4quant.py (2)

1001-1041: LGTM! Clean transition to TVM-FFI tensor-based interface.

The signature change from pointer-based to tensor-based inputs is well-structured. The docstring accurately describes the tensor shapes and the TVM-FFI approach.

1765-1788: Verify the runtime API aligns with the corrected compilation signature.

Once the missing global_scale_fake tensor is added to the cute.compile call, ensure this tensor_api function continues to pass tensors in the correct order matching the kernel's __call__ signature.

flashinfer/cute_dsl/add_rmsnorm_fp4quant.py (2)

1005-1027: LGTM! Docstring correctly describes tensor inputs.

The signature change to tensor-based inputs is well-structured. The docstring accurately describes tensor shapes without the previously flagged incorrect claim about in-place mR updates.

2215-2240: Verify tensor_api aligns with corrected compilation after fix.

Once the missing global_scale_fake tensor is added to the cute.compile call, this tensor_api function should correctly pass tensors in the expected order.

yzh119 · 2026-01-02T08:35:34Z

/bot run

flashinfer-bot · 2026-01-02T08:35:45Z

GitLab MR !224 has been created, and the CI pipeline #41059115 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-01-02T17:29:42Z

[FAILED] Pipeline #41059115: 12/20 passed

bkryu · 2026-01-02T18:28:44Z

[FAILED] Pipeline #41059115: 12/20 passed

@yzh119, the pipeline has segfaults on gb200 & gb300 cu129 tests (but the x86 counterparts b200 & b300 are passing) 👀

yzh119 · 2026-01-03T02:34:27Z

cc @tqchen on the failure, looks similar to what we observed on FA4.

tqchen · 2026-01-05T18:50:52Z

cuteDSL related arm failure should be resolved by cuteDSL 4.3.4

## 📌 Description Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which should resolve the arm issue in #2279 ## 🔍 Related Issues #2279 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Updated internal dependencies to improve stability and compatibility. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

yzh119 · 2026-01-06T06:03:33Z

/bot run

flashinfer-bot · 2026-01-06T06:03:58Z

GitLab MR !224 has been updated with latest changes, and the CI pipeline #41205461 is currently running. I'll report back once the pipeline job completes.

yzh119 · 2026-01-06T18:25:28Z

Hi @bkryu cu129 unittest on gb300 failed, do you think it's relevant?

bkryu · 2026-01-06T18:36:21Z

Hi @bkryu cu129 unittest on gb300 failed, do you think it's relevant?

Failure was unrelated. I relaunched the test. Will keep an a eye on it and then approve

flashinfer-bot · 2026-01-06T22:17:55Z

[SUCCESS] Pipeline #41205461: 8/20 passed

bkryu

Failed GB300 cu129 unit test passed after retry. LGTM

## 📌 Description Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which should resolve the arm issue in flashinfer-ai/flashinfer#2279 ## 🔍 Related Issues #2279 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Updated internal dependencies to improve stability and compatibility. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

wip

db694aa

yzh119 requested review from aleozlx, bkryu, jimmyzho and kaixih as code owners January 1, 2026 06:54

gemini-code-assist Bot reviewed Jan 1, 2026

View reviewed changes

coderabbitai Bot reviewed Jan 1, 2026

View reviewed changes

Comment thread flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

yzh119 and others added 3 commits January 1, 2026 23:59

Update flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

f321ce3

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into cute-dsl-refactor

a57eb21

Merge branch 'cute-dsl-refactor' of github.com:flashinfer-ai/flashinf…

2c74f9e

…er into cute-dsl-refactor

coderabbitai Bot reviewed Jan 2, 2026

View reviewed changes

Comment thread flashinfer/cute_dsl/add_rmsnorm_fp4quant.py

Comment thread flashinfer/cute_dsl/rmsnorm_fp4quant.py

yzh119 added 2 commits January 2, 2026 03:28

fix

1b368a5

fix ut

4c1e2a5

yzh119 mentioned this pull request Jan 5, 2026

dependency: update nvidia-cutlass-dsl #2288

Merged

5 tasks

Merge remote-tracking branch 'origin/main' into cute-dsl-refactor

0bdff30

bkryu approved these changes Jan 6, 2026

View reviewed changes

bkryu merged commit edb37cd into main Jan 6, 2026
4 checks passed

bkryu deleted the cute-dsl-refactor branch January 6, 2026 22:22

This was referenced Jan 28, 2026

refactor: refactoring cuda code to cute-dsl (part 1) #2428

Merged

Add cute-dsl backends to mxfp[8,4]_quantization for future refactor #2443

Merged

fix: RMSNorm/FusedRMSNorm + Quant kernels cuda graph fixes #2459

Open

coderabbitai Bot mentioned this pull request Feb 4, 2026

feat: cuteDSL fp4 moe for better DSR1 performance. #2398

Merged

5 tasks

yzh119 mentioned this pull request Feb 5, 2026

Ameyn/gdn decode cutedsl kernel #2498

Merged

5 tasks

bkryu mentioned this pull request Feb 11, 2026

feat: cute dsl mmfp4 for blackwell #2540

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 11, 2026

Add cute dsl mla decode op #2743

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 12, 2026

feat: Integrate CuTe DSL FMHA prefill kernels by loading cubin #3039

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 26, 2026

cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support #3181

Merged

5 tasks

	# Size is: num_m_tiles * num_k_tiles * 512, which is independent of M
	# Size is `num_m_tiles * num_k_tiles * 512`, which depends on the `M` dimension.

Conversation

yzh119 commented Jan 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist Bot commented Jan 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkryu commented Jan 2, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Jan 2, 2026

Uh oh!

flashinfer-bot commented Jan 2, 2026

Uh oh!

flashinfer-bot commented Jan 2, 2026

Uh oh!

bkryu commented Jan 2, 2026

Uh oh!

yzh119 commented Jan 3, 2026

Uh oh!

tqchen commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Jan 6, 2026

Uh oh!

flashinfer-bot commented Jan 6, 2026

Uh oh!

yzh119 commented Jan 6, 2026

Uh oh!

bkryu commented Jan 6, 2026

Uh oh!

flashinfer-bot commented Jan 6, 2026

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

yzh119 commented Jan 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 1, 2026 •

edited

Loading

tqchen commented Jan 5, 2026 •

edited

Loading