fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 by nv-yunzheq · Pull Request #2735 · flashinfer-ai/flashinfer

nv-yunzheq · 2026-03-09T23:18:55Z

📌 Description

Fix issue #2693
Remove the patch when cute dsl version is order than 4.4.0

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Chores
- Updated Cutlass compatibility layer with conditional patching logic for different Cutlass versions.
- Revised synchronization object creation approach to support newer Cutlass versions.
- Updated version threshold references in internal comments.

coderabbitai · 2026-03-09T23:19:10Z

📝 Walkthrough

Walkthrough

Adds version-gated monkey-patches for older Cutlass versions and replaces synchronization object creation for TCGen05Mma in newer Cutlass (>=4.4.0) to restore compatibility with both legacy and latest Cutlass releases.

Changes

Cohort / File(s)	Summary
Version-gated Monkey-Patches `blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`, `blockscaled_contiguous_grouped_gemm_finalize_fusion.py`	Wraps existing Cutlass monkey-patches with version checks (`if not hasattr(cutlass, "__version__")`), ensuring patches apply only to older Cutlass versions lacking a `__version__` attribute. Updated TODO comment from version threshold 4.4 to 4.3.x support status.
Custom Pipeline Sync Object Replacement `custom_pipeline.py`	Replaces `PipelineAsync._make_sync_object()` calls with direct `MbarrierArray` instantiation for TCGen05Mma consumers and producers across multiple pipeline classes (`PipelineTmaUmma`, `PipelineUmmaAsync`, `PipelineCpAsyncUmma`), bypassing unsupported method in Cutlass >=4.4.0. Adds import for `MbarrierArray` and updates inline comments with version compatibility rationale.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

[Bug] cutedsl fused moe broken on nvidia-cutlass-dsl>=4.4.0 #2693: These changes directly address Cutlass >=4.4.0 compatibility breakage by adding version guards and replacing unsupported method calls with compatible alternatives.

Suggested labels

v0.6.3

Suggested reviewers

nvmbreughe
aleozlx
djmmoss
jimmyzho
cyx-6

Poem

🐰 Cutlass versions dance and drift,
Old and new, a compatibility gift!
Patches guard the ancient past,
While MbarrierArray holds steadfast,
Version checks align the way,
Making both versions work today! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description addresses the issue and intent but lacks detail on Related Issues section and Reviewer Notes. The description is functional but incomplete.
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing MOE failure with nvidia-cutlass-dsl >= 4.4.0 by applying version-aware patches.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

gemini-code-assist · 2026-03-09T23:19:22Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses compatibility issues with nvidia-cutlass-dsl versions 4.4.0 and newer, which were causing failures in cute dsl moe operations. The changes involve adapting the codebase to the updated cutlass API by making certain workarounds conditional on the cutlass version and directly utilizing new synchronization primitives. This ensures robust operation of FlashInfer's fused MoE kernels regardless of the cutlass library version installed.

Highlights

Conditional Monkey-Patching: The monkey-patching for PersistentTileSchedulerParams and _get_cluster_work_idx_with_fastdivmod in cutlass.utils is now conditionally applied only for nvidia-cutlass-dsl versions older than 4.4.0. This prevents conflicts with newer cutlass versions that have integrated these features.
Updated Pipeline Synchronization: The custom_pipeline.py now directly uses MbarrierArray for creating pipeline synchronization objects (both full and empty) when dealing with TCGen05Mma agents. This change specifically addresses compatibility with nvidia-cutlass-dsl versions 4.4.0 and newer, where PipelineAsync._make_sync_object no longer handles these cases.
Improved Version Compatibility: The overall changes ensure that FlashInfer's fused MoE kernels maintain compatibility and correct functionality across different versions of nvidia-cutlass-dsl, specifically handling the transition to version 4.4.0 and beyond.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
- Added a blank line to imports.
- Updated a TODO comment to reflect support for cutlass 4.3.x.
- Wrapped monkey-patching logic in a conditional block, applying it only if cutlass.__version__ is not present (indicating a version older than 4.4.0).
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
- Added a blank line to imports.
- Updated a TODO comment to reflect support for cutlass 4.3.x.
- Wrapped monkey-patching logic in a conditional block, applying it only if cutlass.__version__ is not present (indicating a version older than 4.4.0).
flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py
- Imported MbarrierArray from cutlass.pipeline.helpers.
- Modified create function to directly instantiate MbarrierArray for sync_object_empty (consumer) when TCGen05Mma is involved, to support cutlass >= 4.4.0.
- Modified create function to directly instantiate MbarrierArray for sync_object_full (producer) when TCGen05Mma is involved, to support cutlass >= 4.4.0.
- Modified another create function to directly instantiate MbarrierArray for sync_object_empty (consumer) when TCGen05Mma is involved, to support cutlass >= 4.4.0.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

nv-yunzheq · 2026-03-09T23:20:10Z

/bot run

flashinfer-bot · 2026-03-09T23:21:32Z

GitLab MR !395 has been created, and the CI pipeline #45750903 is currently running. I'll report back once the pipeline job completes.

gemini-code-assist

Code Review

This pull request addresses compatibility issues with nvidia-cutlass-dsl >= 4.4.0. The changes correctly make monkey-patches conditional based on the cutlass version, ensuring they only apply to older versions. However, in flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py, the fix for newer versions is applied unconditionally. This could potentially break compatibility with older versions of cutlass-dsl. I've suggested making these changes conditional as well to maintain broader compatibility, which seems to be the intent of this PR.

gemini-code-assist · 2026-03-09T23:24:26Z

+        # Directly create MbarrierArray for TCGen05Mma consumer, since
+        # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
+        sync_object_empty = MbarrierArray(
+            barrier_storage=barrier_storage.align(min_align=8) + num_stages,
+            num_stages=num_stages,
+            agent=consumer,
        )


The change to directly use MbarrierArray fixes the issue for cutlass-dsl >= 4.4.0. However, this is an unconditional change. If MbarrierArray is not available or behaves differently in older versions of cutlass-dsl, this could break backward compatibility. The other files in this PR use a version check (hasattr(cutlass, "__version__")) to apply changes conditionally. To ensure compatibility with both older and newer versions, it would be safer to wrap this change in a version check.

if hasattr(cutlass, "__version__"): # Directly create MbarrierArray for TCGen05Mma consumer, since # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0. sync_object_empty = MbarrierArray( barrier_storage=barrier_storage.align(min_align=8) + num_stages, num_stages=num_stages, agent=consumer, ) else: sync_object_empty = PipelineAsync._make_sync_object( barrier_storage.align(min_align=8) + num_stages, num_stages, consumer )

gemini-code-assist · 2026-03-09T23:24:26Z

+        # Directly create MbarrierArray for TCGen05Mma producer, since
+        # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
+        sync_object_full = MbarrierArray(
+            barrier_storage=barrier_storage.align(min_align=8),
+            num_stages=num_stages,
+            agent=producer,
        )


Similar to the previous change, this direct use of MbarrierArray is unconditional. To ensure backward compatibility with older cutlass-dsl versions, it's safer to use a version check, as done in other files in this PR.

if hasattr(cutlass, "__version__"): # Directly create MbarrierArray for TCGen05Mma producer, since # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0. sync_object_full = MbarrierArray( barrier_storage=barrier_storage.align(min_align=8), num_stages=num_stages, agent=producer, ) else: sync_object_full = PipelineAsync._make_sync_object( barrier_storage.align(min_align=8), num_stages, producer )

gemini-code-assist · 2026-03-09T23:24:26Z

+        # Directly create MbarrierArray for TCGen05Mma consumer, since
+        # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
+        sync_object_empty = MbarrierArray(
+            barrier_storage=barrier_storage.align(min_align=8) + num_stages,
+            num_stages=num_stages,
+            agent=consumer,
        )


Again, this unconditional change to use MbarrierArray might break compatibility with older cutlass-dsl versions. A conditional check based on the cutlass version would be a more robust solution, consistent with the approach in other files.

if hasattr(cutlass, "__version__"): # Directly create MbarrierArray for TCGen05Mma consumer, since # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0. sync_object_empty = MbarrierArray( barrier_storage=barrier_storage.align(min_align=8) + num_stages, num_stages=num_stages, agent=consumer, ) else: sync_object_empty = PipelineAsync._make_sync_object( barrier_storage.align(min_align=8) + num_stages, num_stages, consumer )

coderabbitai

🧹 Nitpick comments (1)

flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (1)
305-312: Feature-detect the scheduler API instead of cutlass.__version__.

CUTLASS 4.2.1 documents PersistentTileSchedulerParams.__init__ without swizzle_size / raster_along_m, while 4.3.3 and 4.3.4 already document both parameters. That makes hasattr(cutlass, "__version__") a brittle proxy for the capability you actually need here. Checking the constructor signature or the specific scheduler attributes would make this gate line up with the supported API surface. (docs.nvidia.com)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`
around lines 305 - 312, The current gate uses hasattr(cutlass, "__version__")
which is brittle; instead feature-detect the scheduler API by inspecting the
PersistentTileSchedulerParams constructor signature for the new parameters
(e.g., "swizzle_size" and "raster_along_m") or by checking for the presence of
the exact scheduler attributes/methods (e.g.,
cutlass.utils.PersistentTileSchedulerParams.__init__ signature and
cutlass.utils.StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod
or a FastDivmod helper). Replace the hasattr(cutlass, "__version__") check with
a conditional that uses
inspect.signature(cutlass.utils.PersistentTileSchedulerParams.__init__) to test
for the parameter names or uses hasattr to detect the new API surface, and only
apply the monkey-patches (hooked_PersistentTileSchedulerParams_init and
hooked_get_cluster_work_idx_with_fastdivmod) when those features are missing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`:
- Around line 305-312: The current gate uses hasattr(cutlass, "__version__")
which is brittle; instead feature-detect the scheduler API by inspecting the
PersistentTileSchedulerParams constructor signature for the new parameters
(e.g., "swizzle_size" and "raster_along_m") or by checking for the presence of
the exact scheduler attributes/methods (e.g.,
cutlass.utils.PersistentTileSchedulerParams.__init__ signature and
cutlass.utils.StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod
or a FastDivmod helper). Replace the hasattr(cutlass, "__version__") check with
a conditional that uses
inspect.signature(cutlass.utils.PersistentTileSchedulerParams.__init__) to test
for the parameter names or uses hasattr to detect the new API surface, and only
apply the monkey-patches (hooked_PersistentTileSchedulerParams_init and
hooked_get_cluster_work_idx_with_fastdivmod) when those features are missing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7222970a-5a70-4d18-8f7b-7473d5db5aea

📥 Commits

Reviewing files that changed from the base of the PR and between 2bb3e9e and bd69d03.

📒 Files selected for processing (3)

flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py

flashinfer-bot · 2026-03-10T11:22:10Z

[SUCCESS] Pipeline #45750903: 10/20 passed

yongwww · 2026-03-13T16:50:00Z

I cancelled the pr test because the ci won't pass before #2781 lands, and please re-trigger the test after that pr get merged

nv-yunzheq added 3 commits March 9, 2026 15:52

init fix

dd65c5e

fix patch

7d5e05b

pre-commit

bd69d03

nv-yunzheq added run-ci cute-dsl op: moe labels Mar 9, 2026

gemini-code-assist Bot reviewed Mar 9, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

nv-yunzheq mentioned this pull request Mar 10, 2026

[Bug] cutedsl fused moe broken on nvidia-cutlass-dsl>=4.4.0 #2693

Closed

bkryu approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into fix_cutlass_incompatible

4832ff7

nv-yunzheq enabled auto-merge (squash) March 12, 2026 16:59

Merge branch 'main' into fix_cutlass_incompatible

0750ef5

Merge branch 'main' into fix_cutlass_incompatible

5c62050

nv-yunzheq changed the title ~~Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0~~ fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 Mar 16, 2026

nv-yunzheq merged commit a5e5cae into flashinfer-ai:main Mar 16, 2026
28 of 33 checks passed

nv-yunzheq deleted the fix_cutlass_incompatible branch March 16, 2026 20:54

This was referenced Mar 19, 2026

chore: cute dsl nvfp4 moe clean up #2775

Merged

feat: bump nvidia-cutlass-dsl to >=4.4.2 #2833

Merged

nv-yunzheq mentioned this pull request Mar 31, 2026

backend=cute-dsl mm_fp4 fails with cute-DSL error #2852

Closed

Conversation

nv-yunzheq commented Mar 9, 2026 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

nv-yunzheq commented Mar 9, 2026

Uh oh!

flashinfer-bot commented Mar 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Mar 10, 2026

Uh oh!

yongwww commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nv-yunzheq commented Mar 9, 2026 •

edited by yzh119

Loading

coderabbitai Bot commented Mar 9, 2026 •

edited

Loading