[main] feat(moe): Support apply wd to qk layernorm for Qwen3-Next (4/4)#2753
Conversation
|
/ok to test 5e4fd18 |
5e4fd18 to
5afcfdc
Compare
|
/ok to test 5afcfdc |
5afcfdc to
3a29a80
Compare
|
/ok to test 3a29a80 |
3a29a80 to
a34da99
Compare
|
/ok to test a34da99 |
56b10b1 to
11c9076
Compare
|
/ok to test 11c9076 |
|
/ok to test ac0fb11 |
| ) | ||
| combined_override[key] = value | ||
|
|
||
| # Overrides that force overrides. |
| end_wd: float | ||
| wd_mult: float | ||
|
|
||
| _force_override: bool = False |
There was a problem hiding this comment.
Why does this have an underscore at the front?
There was a problem hiding this comment.
#2968 would get rid of this part of the logic, but the answer is that this current implementation adds wd overrides twice, once setting those qk_layernorm layers to wd=0, then later setting them to wd=1. Setting things twice breaks the old logic, so the work-around was to add a "force override" concept. The underscore at the beginning tells the optimizer loop to not include this key as a thing to override in the parameter group. I think the proposal I added in the PR above accomplishes the goal in a cleaner way.
| help='Dropout probability for hidden state transformer.') | ||
| group.add_argument('--weight-decay', type=float, default=0.01, | ||
| help='Weight decay coefficient for L2 regularization.') | ||
| group.add_argument('--apply-wd-to-qk-layernorm', action='store_true', |
There was a problem hiding this comment.
I'm not sure I understand what this option does. In particular, "as a special case"? What's the general case then?
There was a problem hiding this comment.
Generally all len==1 (eg layernorm weights) or bias terms get added to the wd=0 group. This says "do not add the q or k layernorm weights to the wd=0 group, leave them as wd=1".
There was a problem hiding this comment.
I'm nervous about how this approach of a force_override will interact with multiple different kinds of merging. For example it seems to work well for the case of only overriding wd, but what if we also want to override LR (eg decoupled lr) and that has partial overlap with parameters that need to be wd unskipped?
Here is an alternative approach, please feel free to cherry pick the commit over: #2968, (e57b2f5)
The new design works by adding a new kind of predictate that handles the tuple of param,name. That is sufficient for modifying the weight decay skip rule with your filter for qk_layernorm. @FDecaYed came up with this idea which would have simplified his matching rule in megatron bridge for this same problem.
…upport Qwen3 weight decay Signed-off-by: John St. John <jstjohn@nvidia.com>
Thanks @jstjohn and @FDecaYed for your help. Your implementation is much cleaner, so I cherry-pick your changes. Hi @deepakn94 , could you please help take a look about the current version? |
|
/ok to test fc64403 |
deepakn94
left a comment
There was a problem hiding this comment.
Looks great, thank you.
|
/ok to test 2ababd2 |
|
fast merging since the functional tests on main were passing. We had some issues with newer tests we were onboarding. This one shoudl have merged earlier. |
What does this PR do ?
MR to dev.
Design doc
Qwen3-Next functionality PRs.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.