Adding code for Flextron#4429
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
/ok to test 892ca2f |
|
/ok to test 8b9ffae |
|
/ok to test 044965d |
| # Mamba | ||
| def mamba_params(mamba_nheads): | ||
| d_inner = mamba_nheads * mamba_d_head | ||
| ngroups = 8 |
There was a problem hiding this comment.
Bug: ngroups is hardcoded to 8, but the actual model uses config.mamba_num_groups which is configurable. If someone sets mamba_num_groups to a value other than 8, the parameter count estimation (and therefore the budget loss) will be silently wrong.
This should be passed in as a parameter from the caller, which has access to config.mamba_num_groups. The same hardcoded value appears in the mamba_in_proj computation on line 121.
| ngroups = 8 | |
| ngroups = 8 # TODO: pass mamba_num_groups from config instead of hardcoding |
|
|
||
| import random | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Nit: numpy is imported but never used in this file.
| import numpy as np |
| def _allreduce_router_grads(model: List[torch.nn.Module], config: TransformerConfig): | ||
| """ | ||
| All-reduce router grads. | ||
|
|
||
| Reduce grads across all the pp stages to ensure that parameters of the router stay in sync. | ||
| """ | ||
|
|
||
| if parallel_state.get_pipeline_model_parallel_world_size() > 1: | ||
| grads_dict: Dict[str, List[torch.Tensor]] = {} | ||
| for model_chunk in model: | ||
| for name, param in get_attr_wrapped_model(model_chunk, 'named_parameters')(): | ||
| if param.requires_grad and getattr(param, 'flextron_router_pp_sync', False): | ||
| grad = param.main_grad | ||
| if name in grads_dict: | ||
| # Add all the virtual PP rank's gradients to | ||
| # the first local virtual PP rank. | ||
| grads_dict[name][0].add_(grad) | ||
| # Append to the end for later update after cross-rank reduce. | ||
| grads_dict[name].append(grad) | ||
| else: | ||
| grads_dict[name] = [grad] | ||
|
|
||
| if grads_dict: | ||
| # All-reduce the gradient on the first VPP rank. | ||
| grads = [param_grad[0] for _, param_grad in grads_dict.items()] | ||
| coalesced = _flatten_dense_tensors(grads) | ||
| torch.distributed.all_reduce( | ||
| coalesced, group=parallel_state.get_pipeline_model_parallel_group() | ||
| ) | ||
| for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)): | ||
| buf.copy_(synced) | ||
|
|
||
| # Update the gradients on other VPP ranks. | ||
| for grads in grads_dict.values(): | ||
| for grad in grads[1:]: | ||
| grad.copy_(grads[0]) | ||
|
|
There was a problem hiding this comment.
This new function modifies a core distributed file but has no unit test coverage. A test verifying the all-reduce behavior (especially the VPP gradient aggregation logic in lines 291-313) would help prevent regressions, since bugs here would silently produce incorrect router gradients across pipeline stages.
Missing test coverage for elasticity hook managersThe PR adds dedicated unit tests for
These contain non-trivial logic (especially the TopKRouter replacement and GroupedMLP's expert-tensor-parallel mask splitting). Consider adding targeted tests for at least the TopKRouter and GroupedMLP managers. |
|
/claude review |
| @@ -0,0 +1,210 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Nit: make copyright year 2026.
| @@ -0,0 +1,542 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Nit: copyright year 2026.
|
https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/jobs/309002773 |
|
/ok to test 3292cf7 |
|
/ok to test 7e4c52e |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25196645789 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25225179430 |
What does this PR do ?
This PR lands Flextron (also known as Nemotron Elastic / Star Elastic) into Megatron-LM. Flextron is a post-training method that converts a single parent LLM into a nested family of submodels at different parameter budgets — all produced from one training run, all sharing a single checkpoint. A learnable router maps a user-specified budget to per-axis architectural decisions (embedding width, attention heads, Mamba heads, MoE experts, FFN channels); smaller submodels are strict subsets of larger ones via importance-ranked contiguous slicing, and all variants are trained jointly with knowledge distillation from the frozen parent.
Flextron has been used to produce the elastic variants shipped with Nemotron Nano v2 (12B → 9B + 6B) and Nemotron Nano v3 (30B/3.6A MoE → 23B/2.8A + 12B/2.0A). Until now the implementation has lived on private dev branches. This PR consolidates that work into
mainso it can be open-sourced and maintained alongside the rest of the Megatron-LM post-training surface.Files at a glance
megatron/elastification/— new module (manager, hooks, router, budget math, config).pretrain_mamba_flex.py— training entry point with per-microbatch budget sampling.megatron/core/distributed/finalize_model_grads.py— all-reduces router grads across PP ranks, gated onconfig.flextron.megatron/post_training/model_builder.py— teacher-config overrides so KD teachers don't carry the router.tests/unit_tests/elastification/— 10 test files.tests/functional_tests/test_cases/hybrid/hybrid_flextron_nightly_*/+tests/test_utils/recipes/h100/flextron.yaml— nightly functional test.Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.