μP: Maximal Update Parameterization #3058
Conversation
|
Note that this is just width-MuP (the original paper). There's also a new depth-MuP (which would help with not having to train "skinny" models with low-width/high-depth for transfer). A new paper, Complete(d)-P also exists, that I've not entirely gone through. |
|
Hi! @sbhavani, could you take a look? |
|
@plugyawn thanks for the contribution! Please bear with us as this will take some time to review since it touches a lot of areas in core |
|
Thank you, @sbhavani and the team! |
KyroChi
left a comment
There was a problem hiding this comment.
Overall I think this is a good start. I highly recommend you put these changes through Claude or ChatGPT to fix several docstring inconsistencies where the content of the docstring does not match the actual behavior of the code.
Some other things that may be potentially problems:
- You don't actually plot the minima in the figure of val loss vs lr on the right. Sometimes if not properly implemented you can in fact see the optimum shifting, but we won't be able to tell from this plot since we don't see the minima.
- You should plot normalized logits in the coordinate checks since unnormalized logits can hide subtle bugs, like an m^{1/4} dependency or something. When normalized we expect roughly horizontal lines.
- The output multipliers are not automatically set. This seems potentially dangerous, as the default behavior will have outputs which are m times larger than we would expect for muP. I left a comment about this. I want to make sure it was set for your experiments that you plot above. I couldn't confirm from my brief perusal of your fork.
|
I don't think that the optimum shift for the 2048 model is concerning for two reasons:
Regarding this latter point, transfer is really only expected to occur at fixed TPP or something, but due to the asymptotic properties of mup we can usually just get away with optimally training our largest model and overtraining smaller models to demonstrate mu-transfer. You see only 4128500 =256,000 tokens during training, which is ~1 TPP for the smallest model and ~0.003 TPP for the largest model 😝 In my experience you usually need at least 2TPP to get good transfer plots, so in some sense this is already better than I would expect! Regarding the residual power dependency in the logits plot: this could be because Megatron's default optimizer Adam eps is 10^{-8}, which is actually rather high for mup. Since your models are pretty small you can probably get away with setting this to 10^{-12} or even 10^{-15} and see if the coordinate check flattens. This is a quirk of all mup implementations unfortunately and is unlikely to indicate a bug IMO. See also this paper. |
|
Hi @plugyawn, thanks for your contribution here, I was actually implementing this on my branch but since you have all the experiments result ready, we can try merge your PR! |
That makes sense! Thank you!
I was also reminded of https://arxiv.org/abs/2501.16975 and their scaling laws over vocabulary (min. loss decreases with log vocab), although that might be unrelated.
That makes sense! Hahaha and also thanks this sent me into a rabbit hole, learned quite a few things! |
|
Thank you for the endless endurance and spirit! Let's get this merged. :) |
|
/ok to test a66ae5b |
|
Think you just need to rebase onto |
|
/ok to test b1cd2d8 |
|
Just triggered one last CI, I will be merged if all passed, thanks @plugyawn ! |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22463189540 |
…-NeMo#3058) Apply per-parameter-class LR/eps scaling in setup_optimizer when use_mup=True on the model config. Mirrors the get_mup_config_overrides call added to MCore's setup_model_and_optimizer in NVIDIA/Megatron-LM#3058. The μP config fields (use_mup, mup_base_hidden_size, mup_width_mult, etc.) are already present via MCoreTransformerConfig inheritance — no model config changes needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Apply per-parameter-class LR/eps scaling in setup_optimizer when use_mup=True on the model config. Mirrors the get_mup_config_overrides call added to MCore's setup_model_and_optimizer in NVIDIA/Megatron-LM#3058. The μP config fields (use_mup, mup_base_hidden_size, mup_width_mult, etc.) are already present via MCoreTransformerConfig inheritance — no model config changes needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>
…A#3058) New files: - tests/unit_tests/transformer/test_mup.py
These test files import from existing modules that are modified in Phase 2: - test_rmsnorm_residual_fusion.py: imports TEFusedResidualRMSNorm (added in NVIDIA#3384) - test_mup.py: imports get_mup_config_overrides (added in NVIDIA#3058) - test_multimodule_schedules.py: imports MultiModuleProcessGroupCollection (added in NVIDIA#3129) They will be re-added in Phase 2 when the corresponding code changes land. Made-with: Cursor



What does this PR do ?
Adds support for Maximal Update Parameterization (μP) for optimal hyperparameter transfer across model widths.
Addresses issue #2824 opened by @sbhavani.
The idea is to train multiple high-depth, low-width models to recover optimal HPs (i.e, reduced hidden_size), and then transfer to high-width models (i.e, high hidden_size).
Automatic initialization scaling (σ / √(width_mult) for hidden layers) and automatic LR scaling: lr / width_mult for hidden layers (Adam only, not SGD) is also implemented. Embedding/output layers use base LR (no scaling), as in the original TP-V paper.
References:
Tagging the @mcore-oncall
Functional tests and documentation in progress, unit tests added.
Some doubts: in the param_and_grad_buffer, I added an
is_embedding_parametereven though there exists anis_embedding_or_output_parameter. In TP-V, the fan-in of the output layer is interpreted as infinite-width, inlike the embedding layer, which has the fixed vocabulary (both according to the paper and mutransformers). There seem to be conflicts about the case of Tied Embeddings (embedding and output layer share weights, see this discussion.Some plots:
The following plots show the current functioning. In the first image, please note that MuP is on a different Y-axis scale than SP. In the second, I believe training more longer would make the MuP sharing optimal LR much clearer (it's currently on 500 steps). I only have access to an A100 at the moment, so these are character-level transformers trained on enwiki8.
Experiment details:
--widths--base-hidden-sizewidth_mult--num-layers--lr-sweep-steps--num-seeds--seq-len(for LR sweep)--batch_sizePlotting code can be accessed on fork
feature/mup-implementationon my fork of the code.Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.