Support GEMM + Swiglu fused MLP by ksivaman · Pull Request #3890 · NVIDIA/Megatron-LM

ksivaman · 2026-03-16T19:19:07Z

What does this PR do ?

This PR supports GEMM + Swiglu fused MLP via Transformer Engine sequential ops.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

copy-pr-bot · 2026-03-16T19:19:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ko3n1g · 2026-03-16T21:14:49Z

/ok to test 86d4097

svcnvidia-nemo-ci · 2026-03-16T21:15:09Z

❌ Cherry-pick to main failed

The cherry-pick encountered conflicts and could not be completed automatically.

Next steps:

Manually create a PR with these changes to main
Resolve any conflicts

yaox12 · 2026-03-17T16:37:44Z

/claude review

Signed-off-by: ksivamani <ksivamani@nvidia.com>

yaox12 · 2026-03-18T13:53:15Z

Can you add unit tests for the numerics of the fusion, and the remapping of parameter keys?

ksivaman · 2026-03-18T18:48:15Z

@yaox12 The unit test for both the checkpoint loading as well as specifically the fusion numerics are included in NVIDIA/TransformerEngine#2769. I think adding them separately would be a duplicate. Unless you mean an e2e test?

yaox12

LGTM. UTs are covered in TE.

yaox12 · 2026-03-19T14:36:01Z

/ok to test 14f63ad

svcnvidia-nemo-ci · 2026-03-19T14:36:25Z

❌ Cherry-pick to main failed

The cherry-pick encountered conflicts and could not be completed automatically.

Next steps:

Manually create a PR with these changes to main
Resolve any conflicts

Signed-off-by: Xin Yao <xiny@nvidia.com>

svcnvidia-nemo-ci · 2026-03-19T22:48:13Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23320590254

svcnvidia-nemo-ci · 2026-03-20T00:58:39Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23324389643

svcnvidia-nemo-ci · 2026-03-20T02:01:19Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23325909832

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

…r hunk The `_cached_param_buffer_shards_grad_enabled` field, its read site in `start_param_sync()`, and the `with torch.no_grad()` wrap around the coalescing manager all originated in NVIDIA#3890 on the dev branch. The dev sync merge `79aeecfe0` (Mar 25 2026) explicitly removed the read site and the no_grad wrap during conflict resolution when it pulled in the layerwise-optimizer code from main — only the field init survived as an orphan in `__init__`. The active logic was deliberately dropped, no regression was reported on dev or main in the intervening months, and zhongbozhu flagged this exact block on this PR (r3211212707) noting it was removed in dev. For a PR targeting main, resurrecting a hunk that was specifically dropped during a merge — without a fresh repro proving main needs it — is the wrong default. Remove all three pieces (the orphan init, the read site, the no_grad wrap) so this file matches main's shape except for the changes that are genuinely part of this PR's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Support GEMM + Swiglu fused MLP

86d4097

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested review from a team as code owners March 16, 2026 19:19

ko3n1g added the mirror-to-main label Mar 16, 2026

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 16, 2026

QiZhangNV reviewed Mar 17, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py

QiZhangNV reviewed Mar 17, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py

claude Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py Outdated

claude Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py

claude Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py

ksivaman added 2 commits March 17, 2026 12:23

Merge branch 'dev' into ksivaman/fused_grouped_mlp_mxfp8

11e3359

Review comments

3981f36

Signed-off-by: ksivamani <ksivamani@nvidia.com>

ksivaman commented Mar 18, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/experts.py Outdated

ksivaman added 2 commits March 18, 2026 09:07

Update megatron/core/transformer/moe/experts.py

f73a50b

Merge branch 'dev' into ksivaman/fused_grouped_mlp_mxfp8

80c7874

erhoo82 mentioned this pull request Mar 18, 2026

[feature] enable improved Group Linear module for MoE NVIDIA-NeMo/Megatron-Bridge#2885

Closed

Merge branch 'dev' into ksivaman/fused_grouped_mlp_mxfp8

e21d73b

yaox12 approved these changes Mar 19, 2026

View reviewed changes

Merge branch 'dev' into ksivaman/fused_grouped_mlp_mxfp8

14f63ad

yaox12 enabled auto-merge March 19, 2026 14:36

format

d49c517

Signed-off-by: Xin Yao <xiny@nvidia.com>

copy-pr-bot Bot temporarily deployed to test March 19, 2026 19:48 Inactive

yaox12 added this pull request to the merge queue Mar 19, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026

yaox12 added this pull request to the merge queue Mar 20, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 20, 2026

yaox12 added this pull request to the merge queue Mar 20, 2026

Merged via the queue into NVIDIA:dev with commit c72c459 Mar 20, 2026
75 of 80 checks passed

ksivaman mentioned this pull request Mar 20, 2026

Support GEMM + Swiglu fused MLP #3971

Closed

5 tasks

copy-pr-bot Bot pushed a commit that referenced this pull request Mar 20, 2026

Support GEMM + Swiglu fused MLP (#3890)

ba650c7

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

nanz-nv mentioned this pull request Mar 30, 2026

[Dev] Paged Stashing #2690

Merged

zhongbozhu mentioned this pull request Mar 31, 2026

[Dev] Skip routed expert padding for graph-safe MoE #4071

Merged

5 tasks

skyw mentioned this pull request Mar 31, 2026

Merge emerging-optimizers change from dev to main #4060

Closed

5 tasks

Victarry mentioned this pull request Apr 7, 2026

[ROADMAP][Updated on April 07] Megatron Core MoE Roadmap #1729

Open

48 tasks

FDecaYed mentioned this pull request Apr 30, 2026

chore: nightly sync main into dev (28_04_2026) #4505

Closed

5 tasks

Connor-XY mentioned this pull request May 5, 2026

Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main #4636

Merged

3 tasks

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 5, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

f93126f

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 5, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

2bab0ee

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 7, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

c6b2a3d

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 8, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

e099b9f

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 11, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

213c3e7

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 12, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

3beebba

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 12, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

02688e4

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 12, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

b67de9c

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Connor-XY pushed a commit to Connor-XY/Megatron-LM that referenced this pull request May 13, 2026

Support GEMM + Swiglu fused MLP (NVIDIA#3890)

2e618b3

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: ksivamani <ksivamani@nvidia.com>

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

Conversation

ksivaman commented Mar 16, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

ko3n1g commented Mar 16, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaox12 commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaox12 commented Mar 18, 2026

Uh oh!

ksivaman commented Mar 18, 2026

Uh oh!

yaox12 left a comment

Choose a reason for hiding this comment

Uh oh!

yaox12 commented Mar 19, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 19, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 19, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Mar 20, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants