Skip to content

Support output signals for overlapping for cutedsl gemm#1677

Merged
zhyncs merged 90 commits intoflashinfer-ai:mainfrom
fzyzcjy:feat/expert_based_overlap
Sep 15, 2025
Merged

Support output signals for overlapping for cutedsl gemm#1677
zhyncs merged 90 commits intoflashinfer-ai:mainfrom
fzyzcjy:feat/expert_based_overlap

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Sep 14, 2025

📌 Description

sglang main contains prerequisite code now, thus I can extract related parts to sglang/flashinfer/deepep, as is shown here

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the cutedsl gemm implementation by introducing a robust output signaling mechanism. This feature allows the kernel to communicate the completion status of individual batched operations (referred to as "experts") through a dedicated dst_signals array. By integrating new low-level DSL operations for atomic updates and byte manipulation, and carefully adjusting the kernel's synchronization logic, the change provides a fine-grained way to manage dependencies in overlapping computations. This is particularly beneficial for optimizing performance in scenarios requiring asynchronous processing of multiple tasks, such as in Mixture-of-Experts architectures.

Highlights

  • Atomic Signal Mechanism: Implemented a new mechanism to signal completion of work for individual batches (experts) using an optional cute.Pointer to dst_signals, enabling fine-grained synchronization.
  • New DSL Operations: Introduced with_byte, read_byte for Uint64 manipulation and atomic_add_release_global for atomic increments in global memory, leveraging LLVM inline assembly.
  • Scheduler Integration: Modified MaskedSchedulerParams and MaskedScheduler to accept and manage dst_signals, updating dsm_pending_packed and dsm_counter to track signal states for each batch.
  • Kernel Synchronization Logic: Adjusted the kernel's c_pipeline.producer_acquire() and producer_tail() calls to conditionally wait for writes (read=False) when dst_signals are enabled, ensuring proper synchronization before signaling.
  • Python API Extension: Extended the grouped_gemm_nt_masked function and related internal functions to accept an optional dst_signals tensor, allowing users to enable and utilize this signaling feature.
  • Comprehensive Testing: Added new test cases to test_cute_dsl_blockscaled_gemm.py to verify the correctness of the dst_signals functionality, including assertions on the final signal values.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for output signals in the cutedsl GEMM kernel, aimed at enabling better overlapping of computations. The changes are comprehensive, affecting the scheduler, the kernel implementation, and the public-facing API. New helper functions for byte manipulation and atomic operations are added. The core logic modification resides in the kernel's epilogue, which now includes conditional synchronization and atomic signaling. The associated tests have been updated to validate this new functionality. My review focuses on enhancing the maintainability of the complex new kernel logic by suggesting refactoring to reduce code duplication and simplify conditional structures. Overall, the changes appear logically sound and the new feature is well-tested.

Comment thread flashinfer/cute_dsl/blockscaled_gemm.py
Comment thread flashinfer/cute_dsl/blockscaled_gemm.py
@zhyncs zhyncs merged commit 79fe3cd into flashinfer-ai:main Sep 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants