Skip to content

Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+#4318

Merged
yaox12 merged 10 commits into
NVIDIA:devfrom
sraman-rgb:Dense_Grouped_GEMM_dev
Apr 20, 2026
Merged

Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+#4318
yaox12 merged 10 commits into
NVIDIA:devfrom
sraman-rgb:Dense_Grouped_GEMM_dev

Conversation

@sraman-rgb

Copy link
Copy Markdown
Contributor

What does this PR do ?

TEFusedDenseMLP class addition to ensure that dense GEMM is achieved from groupedGEMM with group=1
⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@copy-pr-bot

copy-pr-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gautham-kollu gautham-kollu requested a review from santhnm2 April 15, 2026 15:51
@sraman-rgb sraman-rgb marked this pull request as ready for review April 15, 2026 15:53
@sraman-rgb sraman-rgb requested review from a team as code owners April 15, 2026 15:53
@sraman-rgb sraman-rgb requested review from a team as code owners April 16, 2026 15:04
@sraman-rgb

Copy link
Copy Markdown
Contributor Author

/ok to test 77bbe7d

@gautham-kollu

Copy link
Copy Markdown
Contributor

/ok to test 77bbe7d

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 16, 2026
@yaox12 yaox12 removed request for a team April 17, 2026 07:21
Comment thread megatron/core/extensions/transformer_engine.py Outdated
Comment thread megatron/core/extensions/transformer_engine.py Outdated
Comment thread megatron/core/extensions/transformer_engine.py
@yaox12

yaox12 commented Apr 17, 2026

Copy link
Copy Markdown
Member

/claude review

Comment thread megatron/core/extensions/transformer_engine.py

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The new TEFusedDenseMLP class looks well-structured overall, with good comments explaining the norm-outside-autocast rationale. Two issues:

  1. Bug: Silent bias drop — See inline comment. If a model has add_bias_linear=True, both FC1 and FC2 biases are silently ignored. The parent TEFusedMLP handles these; this subclass does not. Either add a validation rejecting bias configs or add Bias ops to the fused pipeline.

  2. No tests — This introduces a new module with non-trivial logic (recipe selection, norm separation, GroupedLinear wiring, bias handling). Consider adding at least a unit test that instantiates TEFusedDenseMLP and verifies the __init__ validation rejects non-SwiGLU activations, and an integration-style test that runs a forward pass and checks output shape/numerics against TEFusedMLP (where applicable).

Siddhartha Raman S and others added 2 commits April 17, 2026 07:04
…tivation_func ref, add tests

- Use self.config.activation_func (not self.activation_func) to always compare
  the config function regardless of use_te_activation_func setting
- Add RuntimeError for TE < 2.14.0 (GroupedLinear/ScaledSwiGLU ops requirement,
  consistent with TEGroupedMLP._is_fused_impl_supported in experts.py)
- Add ValueError for add_bias_linear=True: GroupedLinear ops are bias=False
  and the CuTeGEMM fusion pattern has no room for a bias op; fail loudly
  instead of silently dropping bias parameters
- Add tests/unit_tests/transformer/test_te_fused_dense_mlp_spec.py covering
  instantiation, wrong-activation, gated_linear_unit=False, and bias errors
@yaox12

yaox12 commented Apr 17, 2026

Copy link
Copy Markdown
Member

/ok to test 08870c5

@yaox12

yaox12 commented Apr 17, 2026

Copy link
Copy Markdown
Member

/claude review

Comment thread megatron/core/extensions/transformer_engine.py Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

One bug found — _norm_seq is stored directly on self which causes nn.Module to register it as a submodule, leading to duplicate parameters in state_dict() and parameters(). See inline comment for the fix (wrap in a tuple, same pattern the base class uses for _fused_impl).

Test coverage note

The tests cover constructor validation (wrong activation, bias, GLU flags), which is good. However, _make_fused_impl() and forward() have no test coverage. A forward-pass smoke test (even one that just checks output shape and requires_grad) would catch regressions in the lazy initialization, norm separation, and FP8 autocast logic. Understandable if this requires SM100+ hardware, but worth adding as a skippable test if possible.

…istration

Storing _norm_seq as a bare te.pytorch.ops.Sequential caused nn.Module to
register it as a submodule, duplicating shared norm weights in state_dict
and parameters(). Wrap in a tuple like _fused_impl. Also fix black formatting
in test file and add test for the submodule registration invariant.
@sraman-rgb

Copy link
Copy Markdown
Contributor Author

/ok to test c52fb49

@sraman-rgb

Copy link
Copy Markdown
Contributor Author

/ok to test d63b03b

@gautham-kollu

Copy link
Copy Markdown
Contributor

/ok to test 1895ffc

@yaox12 yaox12 added this pull request to the merge queue Apr 20, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24643858174

Merged via the queue into NVIDIA:dev with commit be3b874 Apr 20, 2026
60 of 62 checks passed
FDecaYed pushed a commit that referenced this pull request May 20, 2026
Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
pull Bot pushed a commit to codeJRV/Megatron-LM that referenced this pull request May 22, 2026
) (NVIDIA#4786)

Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: gautham-kollu <gkollu@nvidia.com>
Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
santhnm2 pushed a commit to santhnm2/Megatron-LM that referenced this pull request May 26, 2026
) (NVIDIA#4786)

Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: gautham-kollu <gkollu@nvidia.com>
Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
Victarry added a commit to yanring/Megatron-LM that referenced this pull request May 27, 2026
* origin/main: (50 commits)
  Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940)
  ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905)
  fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986)
  test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985)
  ci: Add support for MBridge job gating based on PR labels  (NVIDIA#4926)
  test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984)
  test: re-enable paged stashing MoE tests (NVIDIA#4978)
  Fix elastification unwrap_model import (NVIDIA#4972)
  Avoid offsetting functional test master port (NVIDIA#4973)
  test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931)
  chore(beep boop 🤖): Bump  (main) (2026-05-25)
  test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932)
  Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952)
  ci: restore perf test torchrun logs (NVIDIA#4951)
  Various training utils (NVIDIA#4872)
  ci: Update training script paths in BERT and T5 (NVIDIA#4939)
  [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562)
  Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800)
  Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786)
  Fix paged stashing test submodules lookup (NVIDIA#4925)
  ...

# Conflicts:
#	megatron/training/training.py
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026
) (NVIDIA#4786)

Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: gautham-kollu <gkollu@nvidia.com>
Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
mathemakitten pushed a commit to mathemakitten/Megatron-LM that referenced this pull request Jun 12, 2026
) (NVIDIA#4786)

Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: gautham-kollu <gkollu@nvidia.com>
Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants