Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+#4318
Conversation
|
/ok to test 77bbe7d |
|
/ok to test 77bbe7d |
|
/claude review |
There was a problem hiding this comment.
Review Summary
The new TEFusedDenseMLP class looks well-structured overall, with good comments explaining the norm-outside-autocast rationale. Two issues:
-
Bug: Silent bias drop — See inline comment. If a model has
add_bias_linear=True, both FC1 and FC2 biases are silently ignored. The parentTEFusedMLPhandles these; this subclass does not. Either add a validation rejecting bias configs or addBiasops to the fused pipeline. -
No tests — This introduces a new module with non-trivial logic (recipe selection, norm separation, GroupedLinear wiring, bias handling). Consider adding at least a unit test that instantiates
TEFusedDenseMLPand verifies the__init__validation rejects non-SwiGLU activations, and an integration-style test that runs a forward pass and checks output shape/numerics againstTEFusedMLP(where applicable).
…tivation_func ref, add tests - Use self.config.activation_func (not self.activation_func) to always compare the config function regardless of use_te_activation_func setting - Add RuntimeError for TE < 2.14.0 (GroupedLinear/ScaledSwiGLU ops requirement, consistent with TEGroupedMLP._is_fused_impl_supported in experts.py) - Add ValueError for add_bias_linear=True: GroupedLinear ops are bias=False and the CuTeGEMM fusion pattern has no room for a bias op; fail loudly instead of silently dropping bias parameters - Add tests/unit_tests/transformer/test_te_fused_dense_mlp_spec.py covering instantiation, wrong-activation, gated_linear_unit=False, and bias errors
|
/ok to test 08870c5 |
|
/claude review |
There was a problem hiding this comment.
Review
One bug found — _norm_seq is stored directly on self which causes nn.Module to register it as a submodule, leading to duplicate parameters in state_dict() and parameters(). See inline comment for the fix (wrap in a tuple, same pattern the base class uses for _fused_impl).
Test coverage note
The tests cover constructor validation (wrong activation, bias, GLU flags), which is good. However, _make_fused_impl() and forward() have no test coverage. A forward-pass smoke test (even one that just checks output shape and requires_grad) would catch regressions in the lazy initialization, norm separation, and FP8 autocast logic. Understandable if this requires SM100+ hardware, but worth adding as a skippable test if possible.
…istration Storing _norm_seq as a bare te.pytorch.ops.Sequential caused nn.Module to register it as a submodule, duplicating shared norm weights in state_dict and parameters(). Wrap in a tuple like _fused_impl. Also fix black formatting in test file and add test for the submodule registration invariant.
|
/ok to test c52fb49 |
|
/ok to test d63b03b |
|
/ok to test 1895ffc |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24643858174 |
Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
) (NVIDIA#4786) Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
) (NVIDIA#4786) Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
* origin/main: (50 commits) Drain predecessor reduce-scatter at dispatch time (NVIDIA#4940) ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (NVIDIA#4905) fix(tests): initialize num_microbatches calculator in vision cudagraph tests (NVIDIA#4986) test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (NVIDIA#4985) ci: Add support for MBridge job gating based on PR labels (NVIDIA#4926) test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (NVIDIA#4984) test: re-enable paged stashing MoE tests (NVIDIA#4978) Fix elastification unwrap_model import (NVIDIA#4972) Avoid offsetting functional test master port (NVIDIA#4973) test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (NVIDIA#4931) chore(beep boop 🤖): Bump (main) (2026-05-25) test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (NVIDIA#4932) Fix `get_batch` return order to ignore BlendedDataset provenance fields (NVIDIA#4952) ci: restore perf test torchrun logs (NVIDIA#4951) Various training utils (NVIDIA#4872) ci: Update training script paths in BERT and T5 (NVIDIA#4939) [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (NVIDIA#4562) Fix mxfp8 param gather numerical issue when DP overlap is off (NVIDIA#4800) Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (NVIDIA#4318) (NVIDIA#4786) Fix paged stashing test submodules lookup (NVIDIA#4925) ... # Conflicts: # megatron/training/training.py
) (NVIDIA#4786) Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
) (NVIDIA#4786) Co-authored-by: Siddhartha Raman S <sraman@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: gautham-kollu <gkollu@nvidia.com> Co-authored-by: Siddhartha Raman S <sraman@login-lyris01.lyris.clusters.nvidia.com>
What does this PR do ?
TEFusedDenseMLP class addition to ensure that dense GEMM is achieved from groupedGEMM with group=1
⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.