Skip to content

[Dev] Add permute/unpermute fusion with dispatch/combine in Hybrid-EP#4073

Merged
yaox12 merged 11 commits into
NVIDIA:devfrom
Autumn1998:tongliu_permute_comm_fusion
Apr 17, 2026
Merged

[Dev] Add permute/unpermute fusion with dispatch/combine in Hybrid-EP#4073
yaox12 merged 11 commits into
NVIDIA:devfrom
Autumn1998:tongliu_permute_comm_fusion

Conversation

@Autumn1998

@Autumn1998 Autumn1998 commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

This PR introduce the new feature: fuse the permute/unpermute with the dispatch/combine into 1 kernel
This feature is provided by the hybrid-ep
related PR in main: #4089

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@Autumn1998 Autumn1998 requested review from a team as code owners March 31, 2026 08:25
@copy-pr-bot

copy-pr-bot Bot commented Mar 31, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread megatron/core/transformer/transformer_config.py Outdated
Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated
@Autumn1998 Autumn1998 force-pushed the tongliu_permute_comm_fusion branch from 548af5e to 75407d6 Compare April 1, 2026 04:14
@Autumn1998 Autumn1998 mentioned this pull request Apr 1, 2026
5 tasks
block interleaved format. Instead of interpreting the input tensor
as a concatenation of gates and linear units, it will be
interpreted as alternating blocks of gates and linear units.
moe_hybridep_num_blocks_permute: int = 96

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why restrict this to some blocks and not all the blocks ? Is block an moe block here ? Also why 96 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'block' here refers to a CUDA thread block. In fuse mode, it can be thought of as how many SMs are used by the permute operation. 96 is a setting we tested that delivers reasonably good performance. However, after thinking about it more, it would be better to set this to None, because Hybrid EP will automatically choose a good value when None is passed in.

@gautham-kollu

Copy link
Copy Markdown
Contributor

@Autumn1998 Can we add this to an existing functional test in MLM that is marked "mr-github-slim" so that we protect this against regressions ?

@ko3n1g is "mr-github-slim" the right place to so we run as a part of each PR ?

Consider Megatron-LM/tests/test_utils/recipes/h100/moe.yaml
cc: @ko3n1g for any other thoughts on testing

block interleaved format. Instead of interpreting the input tensor
as a concatenation of gates and linear units, it will be
interpreted as alternating blocks of gates and linear units.
moe_hybridep_num_blocks_permute: int = 96

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be a nit; just a little unclear to me. Could we add a short info what exactly these blocks are how to select a value for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the number of CUDA thread blocks used by the permute operation. In the fuse path, it corresponds to the number of SMs used. In the non-fuse path, there can be cases where multiple blocks are scheduled on the same SM.

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated
@Autumn1998 Autumn1998 force-pushed the tongliu_permute_comm_fusion branch from 75407d6 to 2b56c07 Compare April 7, 2026 10:27

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this assertion be removed with this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@Autumn1998

Copy link
Copy Markdown
Contributor Author

/ok to test 4c5a244

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 9, 2026
@Autumn1998

Copy link
Copy Markdown
Contributor Author

/ok to test db328d0

@Autumn1998

Copy link
Copy Markdown
Contributor Author

/ok to test 6304273

@Autumn1998

Copy link
Copy Markdown
Contributor Author

/ok to test d65baf8

@Victarry Victarry changed the title Add permute/unpermute fusion with dispatch/combine in Hybrid-EP [Dev] Add permute/unpermute fusion with dispatch/combine in Hybrid-EP Apr 15, 2026
@Autumn1998

Autumn1998 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test ce3f875

@yaox12 yaox12 enabled auto-merge April 17, 2026 07:28
@yaox12 yaox12 added this pull request to the merge queue Apr 17, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24558203417

Merged via the queue into NVIDIA:dev with commit f2a40ef Apr 17, 2026
158 of 177 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants