Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ by sraman-rgb · Pull Request #4318 · NVIDIA/Megatron-LM

sraman-rgb · 2026-04-15T14:32:46Z

What does this PR do ?

TEFusedDenseMLP class addition to ensure that dense GEMM is achieved from groupedGEMM with group=1
⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-15T14:32:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

sraman-rgb · 2026-04-16T18:56:35Z

/ok to test 77bbe7d

gautham-kollu · 2026-04-16T18:56:35Z

/ok to test 77bbe7d

yaox12 · 2026-04-17T07:59:45Z

/claude review

claude

Review Summary

The new TEFusedDenseMLP class looks well-structured overall, with good comments explaining the norm-outside-autocast rationale. Two issues:

Bug: Silent bias drop — See inline comment. If a model has add_bias_linear=True, both FC1 and FC2 biases are silently ignored. The parent TEFusedMLP handles these; this subclass does not. Either add a validation rejecting bias configs or add Bias ops to the fused pipeline.
No tests — This introduces a new module with non-trivial logic (recipe selection, norm separation, GroupedLinear wiring, bias handling). Consider adding at least a unit test that instantiates TEFusedDenseMLP and verifies the __init__ validation rejects non-SwiGLU activations, and an integration-style test that runs a forward pass and checks output shape/numerics against TEFusedMLP (where applicable).

…tivation_func ref, add tests - Use self.config.activation_func (not self.activation_func) to always compare the config function regardless of use_te_activation_func setting - Add RuntimeError for TE < 2.14.0 (GroupedLinear/ScaledSwiGLU ops requirement, consistent with TEGroupedMLP._is_fused_impl_supported in experts.py) - Add ValueError for add_bias_linear=True: GroupedLinear ops are bias=False and the CuTeGEMM fusion pattern has no room for a bias op; fail loudly instead of silently dropping bias parameters - Add tests/unit_tests/transformer/test_te_fused_dense_mlp_spec.py covering instantiation, wrong-activation, gated_linear_unit=False, and bias errors

yaox12 · 2026-04-17T14:24:44Z

/ok to test 08870c5

yaox12 · 2026-04-17T14:24:57Z

/claude review

claude

Review

One bug found — _norm_seq is stored directly on self which causes nn.Module to register it as a submodule, leading to duplicate parameters in state_dict() and parameters(). See inline comment for the fix (wrap in a tuple, same pattern the base class uses for _fused_impl).

Test coverage note

The tests cover constructor validation (wrong activation, bias, GLU flags), which is good. However, _make_fused_impl() and forward() have no test coverage. A forward-pass smoke test (even one that just checks output shape and requires_grad) would catch regressions in the lazy initialization, norm separation, and FP8 autocast logic. Understandable if this requires SM100+ hardware, but worth adding as a skippable test if possible.

…istration Storing _norm_seq as a bare te.pytorch.ops.Sequential caused nn.Module to register it as a submodule, duplicating shared norm weights in state_dict and parameters(). Wrap in a tuple like _fused_impl. Also fix black formatting in test file and add test for the submodule registration invariant.

sraman-rgb · 2026-04-17T17:45:29Z

/ok to test c52fb49

sraman-rgb · 2026-04-17T17:53:04Z

/ok to test d63b03b

gautham-kollu · 2026-04-17T20:30:33Z

/ok to test 1895ffc

svcnvidia-nemo-ci · 2026-04-20T01:11:58Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24643858174

Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

) (NVIDIA#4786) Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>

* origin/main: (50 commits) Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940) ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905) fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986) test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985) ci: Add support for MBridge job gating based on PR labels (NVIDIA#4926) test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984) test: re-enable paged stashing MoE tests (NVIDIA#4978) Fix elastification unwrap_model import (NVIDIA#4972) Avoid offsetting functional test master port (NVIDIA#4973) test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931) chore(beep boop 🤖): Bump (main) (2026-05-25) test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932) Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952) ci: restore perf test torchrun logs (NVIDIA#4951) Various training utils (NVIDIA#4872) ci: Update training script paths in BERT and T5 (NVIDIA#4939) [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562) Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800) Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786) Fix paged stashing test submodules lookup (NVIDIA#4925) ... # Conflicts: # megatron/training/training.py

) (NVIDIA#4786) Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>

Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+

66b683e

Siddhartha Raman S added 2 commits April 15, 2026 07:36

Remove use_grouped_gemm_for_dense flag; use env var only

388b9f6

Add dense_grouped_gemm: TransformerConfig field and spec function arg

578b0cc

gautham-kollu requested a review from santhnm2 April 15, 2026 15:51

sraman-rgb marked this pull request as ready for review April 15, 2026 15:53

sraman-rgb requested review from a team as code owners April 15, 2026 15:53

svcnvidia-nemo-ci added the complexity: medium label Apr 15, 2026

Wire dense_grouped_gemm from config through gpt_builders

d085b51

santhnm2 approved these changes Apr 16, 2026

View reviewed changes

Merge branch 'dev' into Dense_Grouped_GEMM_dev

77bbe7d

sraman-rgb requested review from a team as code owners April 16, 2026 15:04

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 16, 2026

copy-pr-bot Bot temporarily deployed to test April 16, 2026 18:57 Inactive

yaox12 removed request for a team April 17, 2026 07:21

yaox12 reviewed Apr 17, 2026

View reviewed changes

Comment thread megatron/core/extensions/transformer_engine.py Outdated

Comment thread megatron/core/extensions/transformer_engine.py Outdated

yaox12 reviewed Apr 17, 2026

View reviewed changes

Comment thread megatron/core/extensions/transformer_engine.py

claude Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread megatron/core/extensions/transformer_engine.py

claude Bot reviewed Apr 17, 2026

View reviewed changes

Siddhartha Raman S and others added 2 commits April 17, 2026 07:04

Merge branch 'dev' into Dense_Grouped_GEMM_dev

08870c5

claude Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread megatron/core/extensions/transformer_engine.py Outdated

claude Bot reviewed Apr 17, 2026

View reviewed changes

fix: sort imports in test_te_fused_dense_mlp_spec.py (isort)

d63b03b

copy-pr-bot Bot temporarily deployed to test April 17, 2026 17:54 Inactive

fix(test): add dense_grouped_gemm to Mamba MoE golden config

1895ffc

copy-pr-bot Bot temporarily deployed to test April 17, 2026 20:31 Inactive

yaox12 enabled auto-merge April 20, 2026 01:11

yaox12 approved these changes Apr 20, 2026

View reviewed changes

yaox12 added this pull request to the merge queue Apr 20, 2026

Merged via the queue into NVIDIA:dev with commit be3b874 Apr 20, 2026
60 of 62 checks passed

svcnvidia-nemo-ci mentioned this pull request May 12, 2026

chore: nightly sync main into dev (12_05_2026) #4744

Closed

FDecaYed pushed a commit that referenced this pull request May 20, 2026

Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318)

8bd72cd

Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

svcnvidia-nemo-ci mentioned this pull request May 27, 2026

chore: nightly sync main into dev (27_05_2026) #5029

Closed

Conversation

sraman-rgb commented Apr 15, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

sraman-rgb commented Apr 16, 2026

Uh oh!

gautham-kollu commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaox12 commented Apr 17, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

yaox12 commented Apr 17, 2026

Uh oh!

yaox12 commented Apr 17, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review

Test coverage note

Uh oh!

sraman-rgb commented Apr 17, 2026

Uh oh!

sraman-rgb commented Apr 17, 2026

Uh oh!

gautham-kollu commented Apr 17, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants