[Dev] Add permute/unpermute fusion with dispatch/combine in Hybrid-EP by Autumn1998 · Pull Request #4073 · NVIDIA/Megatron-LM

Autumn1998 · 2026-03-31T08:25:47Z

What does this PR do ?

This PR introduce the new feature: fuse the permute/unpermute with the dispatch/combine into 1 kernel
This feature is provided by the hybrid-ep
related PR in main: #4089

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-03-31T08:25:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gautham-kollu · 2026-04-05T07:02:08Z

-    block interleaved format. Instead of interpreting the input tensor
-    as a concatenation of gates and linear units, it will be
-    interpreted as alternating blocks of gates and linear units.
+    moe_hybridep_num_blocks_permute: int = 96


I am trying to understand why restrict this to some blocks and not all the blocks ? Is block an moe block here ? Also why 96 ?

The 'block' here refers to a CUDA thread block. In fuse mode, it can be thought of as how many SMs are used by the permute operation. 96 is a setting we tested that delivers reasonably good performance. However, after thinking about it more, it would be better to set this to None, because Hybrid EP will automatically choose a good value when None is passed in.

gautham-kollu · 2026-04-05T07:05:47Z

@Autumn1998 Can we add this to an existing functional test in MLM that is marked "mr-github-slim" so that we protect this against regressions ?

@ko3n1g is "mr-github-slim" the right place to so we run as a part of each PR ?

Consider Megatron-LM/tests/test_utils/recipes/h100/moe.yaml
cc: @ko3n1g for any other thoughts on testing

yashaswikarnati · 2026-04-05T14:47:12Z

-    block interleaved format. Instead of interpreting the input tensor
-    as a concatenation of gates and linear units, it will be
-    interpreted as alternating blocks of gates and linear units.
+    moe_hybridep_num_blocks_permute: int = 96


may be a nit; just a little unclear to me. Could we add a short info what exactly these blocks are how to select a value for it

This is actually the number of CUDA thread blocks used by the permute operation. In the fuse path, it corresponds to the number of SMs used. In the non-fuse path, there can be cases where multiple blocks are scheduled on the same SM.

vasunvidia · 2026-04-08T17:07:14Z

Can this assertion be removed with this PR?

Autumn1998 · 2026-04-09T02:24:24Z

/ok to test 4c5a244

Autumn1998 · 2026-04-13T07:30:54Z

/ok to test db328d0

Autumn1998 · 2026-04-13T08:31:30Z

/ok to test 6304273

Autumn1998 · 2026-04-13T08:33:56Z

/ok to test d65baf8

…egatron-LM into tongliu_permute_comm_fusion

Autumn1998 · 2026-04-17T02:22:25Z

/ok to test ce3f875

svcnvidia-nemo-ci · 2026-04-17T09:31:55Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24558203417

Autumn1998 requested review from a team as code owners March 31, 2026 08:25

Victarry reviewed Mar 31, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py Outdated

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated

Autumn1998 force-pushed the tongliu_permute_comm_fusion branch from 548af5e to 75407d6 Compare April 1, 2026 04:14

Autumn1998 mentioned this pull request Apr 1, 2026

add permute fusion into hybrid ep #4089

Merged

5 tasks

gautham-kollu requested a review from yashaswikarnati April 5, 2026 06:59

gautham-kollu reviewed Apr 5, 2026

View reviewed changes

yashaswikarnati reviewed Apr 5, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/fused_a2a.py Outdated

Victarry mentioned this pull request Apr 7, 2026

[ROADMAP][Updated on April 07] Megatron Core MoE Roadmap #1729

Open

48 tasks

Autumn1998 added 2 commits April 7, 2026 03:25

add permute fusion into hybrid ep

8aa0c01

add fix

2b56c07

Autumn1998 force-pushed the tongliu_permute_comm_fusion branch from 75407d6 to 2b56c07 Compare April 7, 2026 10:27

vasunvidia reviewed Apr 8, 2026

View reviewed changes

rm assert for hybrid-ep

4c5a244

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 9, 2026

fix CI

db328d0

format

6304273

copy-pr-bot Bot had a problem deploying to test April 13, 2026 08:32 Error

Merge branch 'dev' into tongliu_permute_comm_fusion

d65baf8

copy-pr-bot Bot temporarily deployed to test April 13, 2026 08:34 Inactive

Victarry changed the title ~~Add permute/unpermute fusion with dispatch/combine in Hybrid-EP~~ [Dev] Add permute/unpermute fusion with dispatch/combine in Hybrid-EP Apr 15, 2026

Autumn1998 added 2 commits April 15, 2026 19:53

fix mamba CI

abd2d8e

add ut for permute fusion

fc65799

Autumn1998 added 3 commits April 15, 2026 19:58

Merge branch 'tongliu_permute_comm_fusion' of github.com:Autumn1998/M…

cc95a11

…egatron-LM into tongliu_permute_comm_fusion

add ut

e193c30

fix mamba CI

ce3f875

copy-pr-bot Bot temporarily deployed to test April 16, 2026 11:15 Inactive

yaox12 approved these changes Apr 17, 2026

View reviewed changes

yaox12 enabled auto-merge April 17, 2026 07:28

yaox12 added this pull request to the merge queue Apr 17, 2026

Merged via the queue into NVIDIA:dev with commit f2a40ef Apr 17, 2026
158 of 177 checks passed

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

Conversation

Autumn1998 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

gautham-kollu Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Autumn1998 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gautham-kollu commented Apr 5, 2026

Uh oh!

yashaswikarnati Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Autumn1998 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasunvidia Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Autumn1998 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Autumn1998 commented Apr 9, 2026

Uh oh!

Autumn1998 commented Apr 13, 2026

Uh oh!

Autumn1998 commented Apr 13, 2026

Uh oh!

Autumn1998 commented Apr 13, 2026

Uh oh!

Autumn1998 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Autumn1998 commented Mar 31, 2026 •

edited

Loading

Autumn1998 commented Apr 17, 2026 •

edited

Loading