Adding code for Flextron by sheliang-nv · Pull Request #4429 · NVIDIA/Megatron-LM

sheliang-nv · 2026-04-22T17:00:08Z

What does this PR do ?

This PR lands Flextron (also known as Nemotron Elastic / Star Elastic) into Megatron-LM. Flextron is a post-training method that converts a single parent LLM into a nested family of submodels at different parameter budgets — all produced from one training run, all sharing a single checkpoint. A learnable router maps a user-specified budget to per-axis architectural decisions (embedding width, attention heads, Mamba heads, MoE experts, FFN channels); smaller submodels are strict subsets of larger ones via importance-ranked contiguous slicing, and all variants are trained jointly with knowledge distillation from the frozen parent.

Flextron has been used to produce the elastic variants shipped with Nemotron Nano v2 (12B → 9B + 6B) and Nemotron Nano v3 (30B/3.6A MoE → 23B/2.8A + 12B/2.0A). Until now the implementation has lived on private dev branches. This PR consolidates that work into main so it can be open-sourced and maintained alongside the rest of the Megatron-LM post-training surface.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Files at a glance

megatron/elastification/ — new module (manager, hooks, router, budget math, config).
pretrain_mamba_flex.py — training entry point with per-microbatch budget sampling.
megatron/core/distributed/finalize_model_grads.py — all-reduces router grads across PP ranks, gated on config.flextron.
megatron/post_training/model_builder.py — teacher-config overrides so KD teachers don't carry the router.
tests/unit_tests/elastification/ — 10 test files.
tests/functional_tests/test_cases/hybrid/hybrid_flextron_nightly_*/ + tests/test_utils/recipes/h100/flextron.yaml — nightly functional test.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

…it testing

copy-pr-bot · 2026-04-22T17:00:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-22T17:00:18Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

Phlip79

MambaModel has been renamed to HybridModel as of #4099. Can you please update this PR accordingly?

sheliang-nv · 2026-04-22T20:21:21Z

/ok to test 892ca2f

sheliang-nv · 2026-04-22T20:44:56Z

/ok to test 8b9ffae

sheliang-nv · 2026-04-22T22:43:45Z

/ok to test 044965d

ChenhanYu

@kevalmorabia97 and @AAnoosheh to review.

claude · 2026-04-30T20:16:58Z

+    # Mamba
+    def mamba_params(mamba_nheads):
+        d_inner = mamba_nheads * mamba_d_head
+        ngroups = 8


Bug: ngroups is hardcoded to 8, but the actual model uses config.mamba_num_groups which is configurable. If someone sets mamba_num_groups to a value other than 8, the parameter count estimation (and therefore the budget loss) will be silently wrong.

This should be passed in as a parameter from the caller, which has access to config.mamba_num_groups. The same hardcoded value appears in the mamba_in_proj computation on line 121.

Suggested change

ngroups = 8

ngroups = 8 # TODO: pass mamba_num_groups from config instead of hardcoding

claude · 2026-04-30T20:17:00Z

+
+import random
+
+import numpy as np


Nit: numpy is imported but never used in this file.

Suggested change

import numpy as np

claude · 2026-04-30T20:17:07Z

+def _allreduce_router_grads(model: List[torch.nn.Module], config: TransformerConfig):
+    """
+    All-reduce router grads.
+
+    Reduce grads across all the pp stages to ensure that parameters of the router stay in sync.
+    """
+
+    if parallel_state.get_pipeline_model_parallel_world_size() > 1:
+        grads_dict: Dict[str, List[torch.Tensor]] = {}
+        for model_chunk in model:
+            for name, param in get_attr_wrapped_model(model_chunk, 'named_parameters')():
+                if param.requires_grad and getattr(param, 'flextron_router_pp_sync', False):
+                    grad = param.main_grad
+                    if name in grads_dict:
+                        # Add all the virtual PP rank's gradients to
+                        # the first local virtual PP rank.
+                        grads_dict[name][0].add_(grad)
+                        # Append to the end for later update after cross-rank reduce.
+                        grads_dict[name].append(grad)
+                    else:
+                        grads_dict[name] = [grad]
+
+        if grads_dict:
+            # All-reduce the gradient on the first VPP rank.
+            grads = [param_grad[0] for _, param_grad in grads_dict.items()]
+            coalesced = _flatten_dense_tensors(grads)
+            torch.distributed.all_reduce(
+                coalesced, group=parallel_state.get_pipeline_model_parallel_group()
+            )
+            for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
+                buf.copy_(synced)
+
+            # Update the gradients on other VPP ranks.
+            for grads in grads_dict.values():
+                for grad in grads[1:]:
+                    grad.copy_(grads[0])
+


This new function modifies a core distributed file but has no unit test coverage. A test verifying the all-reduce behavior (especially the VPP gradient aggregation logic in lines 291-313) would help prevent regressions, since bugs here would silently produce incorrect router gradients across pipeline stages.

claude · 2026-04-30T20:17:19Z

Missing test coverage for elasticity hook managers

The PR adds dedicated unit tests for FlextronMambaElasticityManager, FlextronStackElasticityManager, and FlextronTransformerLayerElasticityManager, but four other managers have no unit tests:

FlextronMoEElasticityManager — output masking for MoE layers
FlextronGroupedMLPElasticityManager — multi-hook MLP masking with FC1 intermediate masking and expert-TP-aware splitting
FlextronAttentionElasticityManager — QKV scaling and embedding masking for attention
FlextronTopKRouterElasticityManager — replaces the routing method on TopKRouter with a custom topk_softmax_with_capacity that applies expert masking

These contain non-trivial logic (especially the TopKRouter replacement and GroupedMLP's expert-tensor-parallel mask splitting). Consider adding targeted tests for at least the TopKRouter and GroupedMLP managers.

…ty managers

sheliang-nv · 2026-04-30T20:54:44Z

/claude review

claude

LGTM

deepakn94 · 2026-04-30T21:59:54Z

@@ -0,0 +1,210 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


Nit: make copyright year 2026.

deepakn94 · 2026-04-30T22:00:12Z

@@ -0,0 +1,542 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


Nit: copyright year 2026.

sheliang-nv · 2026-04-30T22:51:21Z

https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/jobs/309002773
Link to passing internal functional test

sheliang-nv · 2026-04-30T23:26:53Z

/ok to test 3292cf7

sheliang-nv · 2026-04-30T23:50:02Z

/ok to test 7e4c52e

svcnvidia-nemo-ci · 2026-05-01T00:43:48Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25196645789

svcnvidia-nemo-ci · 2026-05-01T17:35:32Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25225179430

sheliang-nv added 4 commits April 21, 2026 02:35

Added flextron code, training/eval example scripts, functional and un…

fa17762

…it testing

Removed flextron examples

cb56a0f

Added Flextron overrides to load_teacher_model_config

4c18286

Added all reduce Flextron router grads for PP

c38d1f1

sheliang-nv requested review from a team as code owners April 22, 2026 17:00

svcnvidia-nemo-ci marked this pull request as draft April 22, 2026 17:00

sheliang-nv marked this pull request as ready for review April 22, 2026 17:08

svcnvidia-nemo-ci requested a review from a team April 22, 2026 17:09

sheliang-nv assigned JRD971000 Apr 22, 2026

svcnvidia-nemo-ci added the complexity: high label Apr 22, 2026

Phlip79 requested changes Apr 22, 2026

View reviewed changes

sheliang-nv added 2 commits April 22, 2026 11:47

Merge remote-tracking branch 'main/main' into shel/flex_merge

3b8f6ff

Sync with upstream main: adopt Hybrid* naming and new pretrain() entry

892ca2f

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 22, 2026

Applied linter fixes

8b9ffae

copy-pr-bot Bot temporarily deployed to test April 22, 2026 20:45 Inactive

Merge branch 'main' into shel/flex_merge

044965d

copy-pr-bot Bot temporarily deployed to test April 22, 2026 22:44 Inactive

Phlip79 removed the request for review from a team April 23, 2026 00:37

ChenhanYu requested review from ChenhanYu and kevalmorabia97 April 24, 2026 03:40

ChenhanYu requested changes Apr 24, 2026

View reviewed changes

claude Bot reviewed Apr 30, 2026

View reviewed changes

Add unit tests for FlextronTopKRouter and FlextronGroupedMLP elastici…

3292cf7

…ty managers

claude Bot approved these changes Apr 30, 2026

View reviewed changes

deepakn94 approved these changes Apr 30, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Apr 30, 2026

deepakn94 reviewed Apr 30, 2026

View reviewed changes

svcnvidia-nemo-ci added Final Review PR is in the "final review" stage and removed Approved All necessary approvals have been made labels Apr 30, 2026

ChenhanYu approved these changes Apr 30, 2026

View reviewed changes

sheliang-nv enabled auto-merge April 30, 2026 23:26

Autoformat

7e4c52e

copy-pr-bot Bot temporarily deployed to test April 30, 2026 23:50 Inactive

sheliang-nv added this pull request to the merge queue May 1, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 1, 2026

sheliang-nv added this pull request to the merge queue May 1, 2026

Merged via the queue into NVIDIA:main with commit 2d862fe May 1, 2026
65 of 67 checks passed

sheliang-nv deleted the shel/flex_merge branch May 1, 2026 18:41

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

	ngroups = 8
	ngroups = 8 # TODO: pass mamba_num_groups from config instead of hardcoding

		@@ -0,0 +1,210 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,542 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Conversation

sheliang-nv commented Apr 22, 2026

What does this PR do ?

Files at a glance

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Phlip79 left a comment

Choose a reason for hiding this comment

Uh oh!

sheliang-nv commented Apr 22, 2026

Uh oh!

sheliang-nv commented Apr 22, 2026

Uh oh!

sheliang-nv commented Apr 22, 2026

Uh oh!

ChenhanYu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 30, 2026

Missing test coverage for elasticity hook managers

Uh oh!

sheliang-nv commented Apr 30, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

deepakn94 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

deepakn94 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

sheliang-nv commented Apr 30, 2026

Uh oh!

sheliang-nv commented Apr 30, 2026

Uh oh!

sheliang-nv commented Apr 30, 2026

Uh oh!

svcnvidia-nemo-ci commented May 1, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ChenhanYu left a comment •

edited

Loading