μP: Maximal Update Parameterization by plugyawn · Pull Request #3058 · NVIDIA/Megatron-LM

plugyawn · 2026-01-23T20:25:27Z

What does this PR do ?

Adds support for Maximal Update Parameterization (μP) for optimal hyperparameter transfer across model widths.
Addresses issue #2824 opened by @sbhavani.

The idea is to train multiple high-depth, low-width models to recover optimal HPs (i.e, reduced hidden_size), and then transfer to high-width models (i.e, high hidden_size).

Automatic initialization scaling (σ / √(width_mult) for hidden layers) and automatic LR scaling: lr / width_mult for hidden layers (Adam only, not SGD) is also implemented. Embedding/output layers use base LR (no scaling), as in the original TP-V paper.

References:

https://arxiv.org/abs/2203.03466

Tagging the @mcore-oncall

Functional tests and documentation in progress, unit tests added.
Some doubts: in the param_and_grad_buffer, I added an is_embedding_parameter even though there exists an is_embedding_or_output_parameter. In TP-V, the fan-in of the output layer is interpreted as infinite-width, inlike the embedding layer, which has the fixed vocabulary (both according to the paper and mutransformers). There seem to be conflicts about the case of Tied Embeddings (embedding and output layer share weights, see this discussion.

Some plots:

The following plots show the current functioning. In the first image, please note that MuP is on a different Y-axis scale than SP. In the second, I believe training more longer would make the MuP sharing optimal LR much clearer (it's currently on 500 steps). I only have access to an A100 at the moment, so these are character-level transformers trained on enwiki8.

Experiment details:

Parameter	Default	Paper Reference
`--widths`	128,256,512,1024,2048,4096,8192	MuP paper Fig. 1
`--base-hidden-size`	128	Base model for `width_mult`
`--num-layers`	4	Transformer depth
`--lr-sweep-steps`	500	Steps per (width, LR) run
`--num-seeds`	3	-
`--seq-len` (for LR sweep)	128	-
`--batch_size`	4	-

Plotting code can be accessed on fork feature/mup-implementation on my fork of the code.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests [incoming]
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-01-23T20:25:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

plugyawn · 2026-01-23T20:31:43Z

Note that this is just width-MuP (the original paper). There's also a new depth-MuP (which would help with not having to train "skinny" models with low-width/high-depth for transfer). A new paper, Complete(d)-P also exists, that I've not entirely gone through.

plugyawn · 2026-01-24T06:27:00Z

Hi! @sbhavani, could you take a look?

sbhavani · 2026-01-24T17:24:22Z

@plugyawn thanks for the contribution! Please bear with us as this will take some time to review since it touches a lot of areas in core

plugyawn · 2026-01-26T04:47:22Z

Thank you, @sbhavani and the team!

KyroChi

Overall I think this is a good start. I highly recommend you put these changes through Claude or ChatGPT to fix several docstring inconsistencies where the content of the docstring does not match the actual behavior of the code.

Some other things that may be potentially problems:

You don't actually plot the minima in the figure of val loss vs lr on the right. Sometimes if not properly implemented you can in fact see the optimum shifting, but we won't be able to tell from this plot since we don't see the minima.
You should plot normalized logits in the coordinate checks since unnormalized logits can hide subtle bugs, like an m^{1/4} dependency or something. When normalized we expect roughly horizontal lines.
The output multipliers are not automatically set. This seems potentially dangerous, as the default behavior will have outputs which are m times larger than we would expect for muP. I left a comment about this. I want to make sure it was set for your experiments that you plot above. I couldn't confirm from my brief perusal of your fork.

@Skylion007 🫡

plugyawn · 2026-01-29T14:16:33Z

You don't actually plot the minima in the figure of val loss vs lr on the right. Sometimes if not properly implemented you can in fact see the optimum shifting, but we won't be able to tell from this plot since we don't see the minima.

Plotted with the optimum marked. It's on log scale... so the shifted minima is,

Width	SP opt LR	MuP opt LR
128	9.77e-04	9.77e-04
256	4.88e-04	9.77e-04
512	2.44e-04	9.77e-04
1024	1.22e-04	9.77e-04
2048	6.10e-05	1.95e-03
Also 16 instead of 11 LRs, since the minima wasn't clear.

The shift in 2048 is worrying... but the minimum loss looks closeby.

Getting back with coordinate checks soon.

plugyawn · 2026-01-29T17:30:20Z

Normalized logits. T is gradient steps after training.

KyroChi · 2026-01-29T20:01:45Z

I don't think that the optimum shift for the 2048 model is concerning for two reasons:

The loss is almost identical, we expect mup to only hold on average, which means that occasionally the empirical optimum will shift a little bit between runs. The curvature of the lr vs. loss curves usually decreases as the model size increases which only adds to this issue.
The 2048 models are almost certainly undertrained which will ALWAYS favor a larger learning rate.

Regarding this latter point, transfer is really only expected to occur at fixed TPP or something, but due to the asymptotic properties of mup we can usually just get away with optimally training our largest model and overtraining smaller models to demonstrate mu-transfer. You see only 4128500 =256,000 tokens during training, which is ~1 TPP for the smallest model and ~0.003 TPP for the largest model 😝 In my experience you usually need at least 2TPP to get good transfer plots, so in some sense this is already better than I would expect!

Regarding the residual power dependency in the logits plot: this could be because Megatron's default optimizer Adam eps is 10^{-8}, which is actually rather high for mup. Since your models are pretty small you can probably get away with setting this to 10^{-12} or even 10^{-15} and see if the coordinate check flattens. This is a quirk of all mup implementations unfortunately and is unlikely to indicate a bug IMO. See also this paper.

BoxiangW · 2026-01-29T20:09:34Z

Hi @plugyawn, thanks for your contribution here, I was actually implementing this on my branch but since you have all the experiments result ready, we can try merge your PR!

plugyawn · 2026-01-30T04:45:38Z

I don't think that the optimum shift for the 2048 model is concerning for two reasons:

The loss is almost identical, we expect mup to only hold on average, which means that occasionally the empirical optimum will shift a little bit between runs. The curvature of the lr vs. loss curves usually decreases as the model size increases which only adds to this issue.

The 2048 models are almost certainly undertrained which will ALWAYS favor a larger learning rate.

That makes sense! Thank you!

Regarding this latter point, transfer is really only expected to occur at fixed TPP or something, but due to the asymptotic properties of mup we can usually just get away with optimally training our largest model and overtraining smaller models to demonstrate mu-transfer. You see only 4_128_500 =256,000 tokens during training, which is ~1 TPP for the smallest model and ~0.003 TPP for the largest model 😝 In my experience you usually need at least 2TPP to get good transfer plots, so in some sense this is already better than I would expect!
I did not know we had good heuristics for when transfer happens! That makes sense!
I had some intuition, of course, that it must take some time to saturate, but I thought 256,000 should be close to enough... given the smallness of the vocabulary?

I was also reminded of https://arxiv.org/abs/2501.16975 and their scaling laws over vocabulary (min. loss decreases with log vocab), although that might be unrelated.

Regarding the residual power dependency in the logits plot: this could be because Megatron's default optimizer Adam eps is 10^{-8}, which is actually rather high for mup. Since your models are pretty small you can probably get away with setting this to 10^{-12} or even 10^{-15} and see if the coordinate check flattens. This is a quirk of all mup implementations unfortunately and is unlikely to indicate a bug IMO. See also this paper.

That makes sense!

Hahaha and also thanks this sent me into a rabbit hole, learned quite a few things!

janEbert · 2026-02-26T14:20:42Z

Thank you for the endless endurance and spirit! Let's get this merged. :)

janEbert · 2026-02-26T14:22:19Z

/ok to test a66ae5b

janEbert · 2026-02-26T14:35:12Z

Think you just need to rebase onto main and apply tools/autoformat.sh.

plugyawn · 2026-02-26T17:04:35Z

Reran the plots for SGD as well, to be sure:

Autoformat's done, too!

Thank you for the endless endurance and spirit! Let's get this merged. :)

It was very fun!

BoxiangW · 2026-02-26T18:25:43Z

/ok to test b1cd2d8

BoxiangW · 2026-02-26T18:26:17Z

Just triggered one last CI, I will be merged if all passed, thanks @plugyawn !

svcnvidia-nemo-ci · 2026-02-26T22:04:32Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22463189540

…-NeMo#3058) Apply per-parameter-class LR/eps scaling in setup_optimizer when use_mup=True on the model config. Mirrors the get_mup_config_overrides call added to MCore's setup_model_and_optimizer in NVIDIA/Megatron-LM#3058. The μP config fields (use_mup, mup_base_hidden_size, mup_width_mult, etc.) are already present via MCoreTransformerConfig inheritance — no model config changes needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply per-parameter-class LR/eps scaling in setup_optimizer when use_mup=True on the model config. Mirrors the get_mup_config_overrides call added to MCore's setup_model_and_optimizer in NVIDIA/Megatron-LM#3058. The μP config fields (use_mup, mup_base_hidden_size, mup_width_mult, etc.) are already present via MCoreTransformerConfig inheritance — no model config changes needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

…A#3058) New files: - tests/unit_tests/transformer/test_mup.py

These test files import from existing modules that are modified in Phase 2: - test_rmsnorm_residual_fusion.py: imports TEFusedResidualRMSNorm (added in NVIDIA#3384) - test_mup.py: imports get_mup_config_overrides (added in NVIDIA#3058) - test_multimodule_schedules.py: imports MultiModuleProcessGroupCollection (added in NVIDIA#3129) They will be re-added in Phase 2 when the corresponding code changes land. Made-with: Cursor

plugyawn requested review from a team as code owners January 23, 2026 20:25

ko3n1g requested a review from a team January 23, 2026 20:25

github-actions Bot added the community-request label Jan 23, 2026

plugyawn changed the title ~~μP: Maximal Update Parameterization [Draft]~~ μP: Maximal Update Parameterization Jan 23, 2026

plugyawn force-pushed the feature/mup branch from d4b66db to aeca7cc Compare January 23, 2026 20:56

pilot7747 mentioned this pull request Jan 27, 2026

Implement muP NVIDIA-NeMo/Megatron-Bridge#2080

Open

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 28, 2026

KyroChi reviewed Jan 28, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py

plugyawn force-pushed the feature/mup branch from aeca7cc to 963e49e Compare January 29, 2026 08:04

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 29, 2026

BoxiangW reviewed Jan 29, 2026

View reviewed changes

Comment thread megatron/core/optimizer/__init__.py Outdated

BoxiangW reviewed Jan 29, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py

BoxiangW reviewed Jan 29, 2026

View reviewed changes

Comment thread megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

BoxiangW reviewed Jan 29, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py Outdated

BoxiangW mentioned this pull request Jan 29, 2026

Feature Request: μP (Maximal Update Parameterization) #2824

Closed

janEbert reviewed Jan 30, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_config.py

chtruong814 added the needs-follow-up Issue needs follow-up label Feb 1, 2026

fix(mup): implement Table-8 SGD LR scaling with decoupled precedence

a66ae5b

janEbert approved these changes Feb 26, 2026

View reviewed changes

plugyawn added 2 commits February 26, 2026 22:13

Merge branch 'main' into feature/mup

37a0cb3

style: run tools/autoformat.sh after rebasing main

b1cd2d8

jaredcasper approved these changes Feb 26, 2026

View reviewed changes

BoxiangW enabled auto-merge February 26, 2026 18:25

BoxiangW assigned plugyawn Feb 26, 2026

copy-pr-bot Bot temporarily deployed to test February 26, 2026 18:26 Inactive

BoxiangW added this pull request to the merge queue Feb 26, 2026

Merged via the queue into NVIDIA:main with commit 310082a Feb 26, 2026
51 checks passed

BoxiangW pushed a commit to BoxiangW/Megatron-LM that referenced this pull request Mar 4, 2026

μP: Maximal Update Parameterization (NVIDIA#3058)

bafed09

ilml added a commit to ilml/Megatron-LM that referenced this pull request Mar 20, 2026

Add new files from 310082a μP: Maximal Update Parameterization (NVIDI…

3ef2dcb

…A#3058) New files: - tests/unit_tests/transformer/test_mup.py

sbhavani mentioned this pull request Mar 23, 2026

[ROADMAP][2026 Q1] Megatron Core Roadmap #4003

Open

plugyawn mentioned this pull request Apr 1, 2026

Hyperparameter Transfer beyond MuP #4088

Open

plugyawn mentioned this pull request Apr 19, 2026

General scaling policy for HP transfer + Depth MuP recipe #4381

Open

5 tasks

yangbofun pushed a commit to xlm-research/Megatron-LM that referenced this pull request May 22, 2026

μP: Maximal Update Parameterization (NVIDIA#3058)

af0b977

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

This was referenced May 26, 2026

[2/4] Hyperparameter Transfer: add canonical width-MuP recipe #4998

Draft

[1/4] Hyperparameter Transfer: add scaling policy infrastructure #4829

Open

Conversation

plugyawn commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Jan 23, 2026

Uh oh!

plugyawn commented Jan 23, 2026

Uh oh!

plugyawn commented Jan 24, 2026

Uh oh!

sbhavani commented Jan 24, 2026

Uh oh!

plugyawn commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KyroChi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

plugyawn commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plugyawn commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KyroChi commented Jan 29, 2026

Uh oh!

BoxiangW commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

plugyawn commented Jan 30, 2026

Uh oh!

Uh oh!

janEbert commented Feb 26, 2026

Uh oh!

janEbert commented Feb 26, 2026

Uh oh!

janEbert commented Feb 26, 2026

Uh oh!

plugyawn commented Feb 26, 2026

Uh oh!

BoxiangW commented Feb 26, 2026

Uh oh!

BoxiangW commented Feb 26, 2026

Uh oh!

svcnvidia-nemo-ci commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

plugyawn commented Jan 23, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`

plugyawn commented Jan 26, 2026 •

edited

Loading

plugyawn commented Jan 29, 2026 •

edited

Loading

plugyawn commented Jan 29, 2026 •

edited

Loading