[main] feat(moe): Support apply wd to qk layernorm for Qwen3-Next (4/4) by yuzhongw-nvidia · Pull Request #2753 · NVIDIA/Megatron-LM

yuzhongw-nvidia · 2025-12-24T06:04:36Z

What does this PR do ?

Qwen3-Next functionality PRs.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-12-24T06:04:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuzhongw-nvidia · 2025-12-24T06:13:02Z

/ok to test 5e4fd18

yuzhongw-nvidia · 2025-12-24T06:59:09Z

/ok to test 5afcfdc

yuzhongw-nvidia · 2025-12-24T07:34:14Z

/ok to test 3a29a80

yuzhongw-nvidia · 2026-01-06T05:29:38Z

/ok to test a34da99

yuzhongw-nvidia · 2026-01-14T06:49:28Z

/ok to test 11c9076

Phlip79 · 2026-01-14T20:07:03Z

/ok to test ac0fb11

deepakn94 · 2026-01-15T05:31:04Z

                    )
            combined_override[key] = value
+
+    # Overrides that force overrides.


What does this mean?

deepakn94 · 2026-01-15T05:35:44Z

    end_wd: float
    wd_mult: float

+    _force_override: bool = False


Why does this have an underscore at the front?

#2968 would get rid of this part of the logic, but the answer is that this current implementation adds wd overrides twice, once setting those qk_layernorm layers to wd=0, then later setting them to wd=1. Setting things twice breaks the old logic, so the work-around was to add a "force override" concept. The underscore at the beginning tells the optimizer loop to not include this key as a thing to override in the parameter group. I think the proposal I added in the PR above accomplishes the goal in a cleaner way.

deepakn94 · 2026-01-15T05:39:41Z

                       help='Dropout probability for hidden state transformer.')
    group.add_argument('--weight-decay', type=float, default=0.01,
                       help='Weight decay coefficient for L2 regularization.')
+    group.add_argument('--apply-wd-to-qk-layernorm', action='store_true',


I'm not sure I understand what this option does. In particular, "as a special case"? What's the general case then?

Generally all len==1 (eg layernorm weights) or bias terms get added to the wd=0 group. This says "do not add the q or k layernorm weights to the wd=0 group, leave them as wd=1".

jstjohn

I'm nervous about how this approach of a force_override will interact with multiple different kinds of merging. For example it seems to work well for the case of only overriding wd, but what if we also want to override LR (eg decoupled lr) and that has partial overlap with parameters that need to be wd unskipped?

Here is an alternative approach, please feel free to cherry pick the commit over: #2968, (e57b2f5)

The new design works by adding a new kind of predictate that handles the tuple of param,name. That is sufficient for modifying the weight decay skip rule with your filter for qk_layernorm. @FDecaYed came up with this idea which would have simplified his matching rule in megatron bridge for this same problem.

…upport Qwen3 weight decay Signed-off-by: John St. John <jstjohn@nvidia.com>

yuzhongw-nvidia · 2026-01-16T04:11:38Z

I'm nervous about how this approach of a force_override will interact with multiple different kinds of merging. For example it seems to work well for the case of only overriding wd, but what if we also want to override LR (eg decoupled lr) and that has partial overlap with parameters that need to be wd unskipped?

Here is an alternative approach, please feel free to cherry pick the commit over: #2968, (e57b2f5)

The new design works by adding a new kind of predictate that handles the tuple of param,name. That is sufficient for modifying the weight decay skip rule with your filter for qk_layernorm. @FDecaYed came up with this idea which would have simplified his matching rule in megatron bridge for this same problem.

Thanks @jstjohn and @FDecaYed for your help. Your implementation is much cleaner, so I cherry-pick your changes.

Hi @deepakn94 , could you please help take a look about the current version?

yuzhongw-nvidia · 2026-01-16T04:11:54Z

/ok to test fc64403

jstjohn

Thank you!

deepakn94

Looks great, thank you.

deepakn94 · 2026-01-16T19:01:42Z

/ok to test 2ababd2

chtruong814 · 2026-01-16T22:57:51Z

fast merging since the functional tests on main were passing. We had some issues with newer tests we were onboarding. This one shoudl have merged earlier.

yuzhongw-nvidia requested review from a team as code owners December 24, 2025 06:04

github-actions Bot requested a review from Phlip79 December 24, 2025 06:04

This was referenced Dec 24, 2025

[main] feat(moe): Support gated delta net for Qwen3-Next (1/4) #1989

Merged

[main] feat(moe): Support moe shared expert gate for Qwen3-Next (2/4) #2751

Merged

[main] feat(moe): Support attention output gate for Qwen3-Next (3/4) #2752

Merged

yuzhongw-nvidia requested a review from a team December 24, 2025 06:11

copy-pr-bot Bot temporarily deployed to nemo-ci December 24, 2025 06:13 Inactive

ko3n1g added this to the Core 0.16 milestone Dec 24, 2025

copy-pr-bot Bot had a problem deploying to nemo-ci December 24, 2025 06:13 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci December 24, 2025 06:13 Inactive

yuzhongw-nvidia force-pushed the qwen3next_wd branch from 5e4fd18 to 5afcfdc Compare December 24, 2025 06:58

copy-pr-bot Bot temporarily deployed to nemo-ci December 24, 2025 06:59 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci December 24, 2025 06:59 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci December 24, 2025 06:59 Inactive

copy-pr-bot Bot temporarily deployed to test December 24, 2025 07:00 Inactive

yuzhongw-nvidia force-pushed the qwen3next_wd branch from 5afcfdc to 3a29a80 Compare December 24, 2025 07:24

copy-pr-bot Bot temporarily deployed to nemo-ci December 24, 2025 07:34 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci December 24, 2025 07:34 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci December 24, 2025 07:34 Inactive

yuzhongw-nvidia force-pushed the qwen3next_wd branch from 3a29a80 to a34da99 Compare January 6, 2026 05:29

copy-pr-bot Bot temporarily deployed to nemo-ci January 6, 2026 05:29 Inactive

yuzhongw-nvidia force-pushed the qwen3next_wd branch from 56b10b1 to 11c9076 Compare January 14, 2026 06:45

copy-pr-bot Bot temporarily deployed to nemo-ci January 14, 2026 06:49 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 14, 2026 06:49 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 14, 2026 06:49 Inactive

copy-pr-bot Bot temporarily deployed to test January 14, 2026 06:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 14, 2026 20:07 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 14, 2026 20:07 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci January 14, 2026 20:07 Inactive

copy-pr-bot Bot temporarily deployed to test January 14, 2026 20:07 Inactive

Phlip79 requested a review from deepakn94 January 14, 2026 21:30

deepakn94 reviewed Jan 15, 2026

View reviewed changes

Comment thread megatron/training/arguments.py Outdated

deepakn94 requested a review from jstjohn January 15, 2026 05:41

jstjohn requested changes Jan 15, 2026

View reviewed changes

yuzhongw-nvidia and others added 2 commits January 15, 2026 20:05

apply_wd_to_qk_layernorm

85e7e62

Adding support for ParamWithName lambdas for matching, and using to s…

fc64403

…upport Qwen3 weight decay Signed-off-by: John St. John <jstjohn@nvidia.com>

jstjohn approved these changes Jan 16, 2026

View reviewed changes

deepakn94 approved these changes Jan 16, 2026

View reviewed changes

Merge branch 'main' into qwen3next_wd

2ababd2

jstjohn mentioned this pull request Jan 21, 2026

Refactor the optimizer override function so that users can swap in their own NVIDIA-NeMo/Megatron-Bridge#2010

Merged

5 tasks

Conversation

yuzhongw-nvidia commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Dec 24, 2025

Uh oh!

yuzhongw-nvidia commented Dec 24, 2025

Uh oh!

yuzhongw-nvidia commented Dec 24, 2025

Uh oh!

yuzhongw-nvidia commented Dec 24, 2025

Uh oh!

yuzhongw-nvidia commented Jan 6, 2026

Uh oh!

yuzhongw-nvidia commented Jan 14, 2026

Uh oh!

Phlip79 commented Jan 14, 2026

Uh oh!

deepakn94 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

deepakn94 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

jstjohn Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

deepakn94 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

jstjohn Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jstjohn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuzhongw-nvidia commented Jan 16, 2026

Uh oh!

yuzhongw-nvidia commented Jan 16, 2026

Uh oh!

jstjohn left a comment

Choose a reason for hiding this comment

Uh oh!

deepakn94 left a comment

Choose a reason for hiding this comment

Uh oh!

deepakn94 commented Jan 16, 2026

Uh oh!

chtruong814 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yuzhongw-nvidia commented Dec 24, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`

jstjohn left a comment •

edited

Loading