Skip to content

fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0#2735

Merged
nv-yunzheq merged 6 commits intoflashinfer-ai:mainfrom
nv-yunzheq:fix_cutlass_incompatible
Mar 16, 2026
Merged

fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0#2735
nv-yunzheq merged 6 commits intoflashinfer-ai:mainfrom
nv-yunzheq:fix_cutlass_incompatible

Conversation

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

@nv-yunzheq nv-yunzheq commented Mar 9, 2026

📌 Description

Fix issue #2693
Remove the patch when cute dsl version is order than 4.4.0

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores
    • Updated Cutlass compatibility layer with conditional patching logic for different Cutlass versions.
    • Revised synchronization object creation approach to support newer Cutlass versions.
    • Updated version threshold references in internal comments.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Adds version-gated monkey-patches for older Cutlass versions and replaces synchronization object creation for TCGen05Mma in newer Cutlass (>=4.4.0) to restore compatibility with both legacy and latest Cutlass releases.

Changes

Cohort / File(s) Summary
Version-gated Monkey-Patches
blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py, blockscaled_contiguous_grouped_gemm_finalize_fusion.py
Wraps existing Cutlass monkey-patches with version checks (if not hasattr(cutlass, "__version__")), ensuring patches apply only to older Cutlass versions lacking a __version__ attribute. Updated TODO comment from version threshold 4.4 to 4.3.x support status.
Custom Pipeline Sync Object Replacement
custom_pipeline.py
Replaces PipelineAsync._make_sync_object() calls with direct MbarrierArray instantiation for TCGen05Mma consumers and producers across multiple pipeline classes (PipelineTmaUmma, PipelineUmmaAsync, PipelineCpAsyncUmma), bypassing unsupported method in Cutlass >=4.4.0. Adds import for MbarrierArray and updates inline comments with version compatibility rationale.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested labels

v0.6.3

Suggested reviewers

  • nvmbreughe
  • aleozlx
  • djmmoss
  • jimmyzho
  • cyx-6

Poem

🐰 Cutlass versions dance and drift,
Old and new, a compatibility gift!
Patches guard the ancient past,
While MbarrierArray holds steadfast,
Version checks align the way,
Making both versions work today!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description addresses the issue and intent but lacks detail on Related Issues section and Reviewer Notes. The description is functional but incomplete.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing MOE failure with nvidia-cutlass-dsl >= 4.4.0 by applying version-aware patches.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses compatibility issues with nvidia-cutlass-dsl versions 4.4.0 and newer, which were causing failures in cute dsl moe operations. The changes involve adapting the codebase to the updated cutlass API by making certain workarounds conditional on the cutlass version and directly utilizing new synchronization primitives. This ensures robust operation of FlashInfer's fused MoE kernels regardless of the cutlass library version installed.

Highlights

  • Conditional Monkey-Patching: The monkey-patching for PersistentTileSchedulerParams and _get_cluster_work_idx_with_fastdivmod in cutlass.utils is now conditionally applied only for nvidia-cutlass-dsl versions older than 4.4.0. This prevents conflicts with newer cutlass versions that have integrated these features.
  • Updated Pipeline Synchronization: The custom_pipeline.py now directly uses MbarrierArray for creating pipeline synchronization objects (both full and empty) when dealing with TCGen05Mma agents. This change specifically addresses compatibility with nvidia-cutlass-dsl versions 4.4.0 and newer, where PipelineAsync._make_sync_object no longer handles these cases.
  • Improved Version Compatibility: The overall changes ensure that FlashInfer's fused MoE kernels maintain compatibility and correct functionality across different versions of nvidia-cutlass-dsl, specifically handling the transition to version 4.4.0 and beyond.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
    • Added a blank line to imports.
    • Updated a TODO comment to reflect support for cutlass 4.3.x.
    • Wrapped monkey-patching logic in a conditional block, applying it only if cutlass.__version__ is not present (indicating a version older than 4.4.0).
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
    • Added a blank line to imports.
    • Updated a TODO comment to reflect support for cutlass 4.3.x.
    • Wrapped monkey-patching logic in a conditional block, applying it only if cutlass.__version__ is not present (indicating a version older than 4.4.0).
  • flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py
    • Imported MbarrierArray from cutlass.pipeline.helpers.
    • Modified create function to directly instantiate MbarrierArray for sync_object_empty (consumer) when TCGen05Mma is involved, to support cutlass >= 4.4.0.
    • Modified create function to directly instantiate MbarrierArray for sync_object_full (producer) when TCGen05Mma is involved, to support cutlass >= 4.4.0.
    • Modified another create function to directly instantiate MbarrierArray for sync_object_empty (consumer) when TCGen05Mma is involved, to support cutlass >= 4.4.0.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@nv-yunzheq
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !395 has been created, and the CI pipeline #45750903 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses compatibility issues with nvidia-cutlass-dsl >= 4.4.0. The changes correctly make monkey-patches conditional based on the cutlass version, ensuring they only apply to older versions. However, in flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py, the fix for newer versions is applied unconditionally. This could potentially break compatibility with older versions of cutlass-dsl. I've suggested making these changes conditional as well to maintain broader compatibility, which seems to be the intent of this PR.

Comment on lines +199 to 205
# Directly create MbarrierArray for TCGen05Mma consumer, since
# PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
sync_object_empty = MbarrierArray(
barrier_storage=barrier_storage.align(min_align=8) + num_stages,
num_stages=num_stages,
agent=consumer,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change to directly use MbarrierArray fixes the issue for cutlass-dsl >= 4.4.0. However, this is an unconditional change. If MbarrierArray is not available or behaves differently in older versions of cutlass-dsl, this could break backward compatibility. The other files in this PR use a version check (hasattr(cutlass, "__version__")) to apply changes conditionally. To ensure compatibility with both older and newer versions, it would be safer to wrap this change in a version check.

        if hasattr(cutlass, "__version__"):
            # Directly create MbarrierArray for TCGen05Mma consumer, since
            # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
            sync_object_empty = MbarrierArray(
                barrier_storage=barrier_storage.align(min_align=8) + num_stages,
                num_stages=num_stages,
                agent=consumer,
            )
        else:
            sync_object_empty = PipelineAsync._make_sync_object(
                barrier_storage.align(min_align=8) + num_stages, num_stages, consumer
            )

Comment on lines +345 to 351
# Directly create MbarrierArray for TCGen05Mma producer, since
# PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
sync_object_full = MbarrierArray(
barrier_storage=barrier_storage.align(min_align=8),
num_stages=num_stages,
agent=producer,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous change, this direct use of MbarrierArray is unconditional. To ensure backward compatibility with older cutlass-dsl versions, it's safer to use a version check, as done in other files in this PR.

        if hasattr(cutlass, "__version__"):
            # Directly create MbarrierArray for TCGen05Mma producer, since
            # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
            sync_object_full = MbarrierArray(
                barrier_storage=barrier_storage.align(min_align=8),
                num_stages=num_stages,
                agent=producer,
            )
        else:
            sync_object_full = PipelineAsync._make_sync_object(
                barrier_storage.align(min_align=8), num_stages, producer
            )

Comment on lines +509 to 515
# Directly create MbarrierArray for TCGen05Mma consumer, since
# PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
sync_object_empty = MbarrierArray(
barrier_storage=barrier_storage.align(min_align=8) + num_stages,
num_stages=num_stages,
agent=consumer,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Again, this unconditional change to use MbarrierArray might break compatibility with older cutlass-dsl versions. A conditional check based on the cutlass version would be a more robust solution, consistent with the approach in other files.

        if hasattr(cutlass, "__version__"):
            # Directly create MbarrierArray for TCGen05Mma consumer, since
            # PipelineAsync._make_sync_object does not handle TCGen05Mma in cutlass >= 4.4.0.
            sync_object_empty = MbarrierArray(
                barrier_storage=barrier_storage.align(min_align=8) + num_stages,
                num_stages=num_stages,
                agent=consumer,
            )
        else:
            sync_object_empty = PipelineAsync._make_sync_object(
                barrier_storage.align(min_align=8) + num_stages, num_stages, consumer
            )

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (1)

305-312: Feature-detect the scheduler API instead of cutlass.__version__.

CUTLASS 4.2.1 documents PersistentTileSchedulerParams.__init__ without swizzle_size / raster_along_m, while 4.3.3 and 4.3.4 already document both parameters. That makes hasattr(cutlass, "__version__") a brittle proxy for the capability you actually need here. Checking the constructor signature or the specific scheduler attributes would make this gate line up with the supported API surface. (docs.nvidia.com)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`
around lines 305 - 312, The current gate uses hasattr(cutlass, "__version__")
which is brittle; instead feature-detect the scheduler API by inspecting the
PersistentTileSchedulerParams constructor signature for the new parameters
(e.g., "swizzle_size" and "raster_along_m") or by checking for the presence of
the exact scheduler attributes/methods (e.g.,
cutlass.utils.PersistentTileSchedulerParams.__init__ signature and
cutlass.utils.StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod
or a FastDivmod helper). Replace the hasattr(cutlass, "__version__") check with
a conditional that uses
inspect.signature(cutlass.utils.PersistentTileSchedulerParams.__init__) to test
for the parameter names or uses hasattr to detect the new API surface, and only
apply the monkey-patches (hooked_PersistentTileSchedulerParams_init and
hooked_get_cluster_work_idx_with_fastdivmod) when those features are missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`:
- Around line 305-312: The current gate uses hasattr(cutlass, "__version__")
which is brittle; instead feature-detect the scheduler API by inspecting the
PersistentTileSchedulerParams constructor signature for the new parameters
(e.g., "swizzle_size" and "raster_along_m") or by checking for the presence of
the exact scheduler attributes/methods (e.g.,
cutlass.utils.PersistentTileSchedulerParams.__init__ signature and
cutlass.utils.StaticPersistentTileScheduler._get_cluster_work_idx_with_fastdivmod
or a FastDivmod helper). Replace the hasattr(cutlass, "__version__") check with
a conditional that uses
inspect.signature(cutlass.utils.PersistentTileSchedulerParams.__init__) to test
for the parameter names or uses hasattr to detect the new API surface, and only
apply the monkey-patches (hooked_PersistentTileSchedulerParams_init and
hooked_get_cluster_work_idx_with_fastdivmod) when those features are missing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7222970a-5a70-4d18-8f7b-7473d5db5aea

📥 Commits

Reviewing files that changed from the base of the PR and between 2bb3e9e and bd69d03.

📒 Files selected for processing (3)
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #45750903: 10/20 passed

@nv-yunzheq nv-yunzheq enabled auto-merge (squash) March 12, 2026 16:59
@yongwww
Copy link
Copy Markdown
Member

yongwww commented Mar 13, 2026

I cancelled the pr test because the ci won't pass before #2781 lands, and please re-trigger the test after that pr get merged

@nv-yunzheq nv-yunzheq changed the title Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 Mar 16, 2026
@nv-yunzheq nv-yunzheq merged commit a5e5cae into flashinfer-ai:main Mar 16, 2026
28 of 33 checks passed
@nv-yunzheq nv-yunzheq deleted the fix_cutlass_incompatible branch March 16, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants