[dev] [DeepSeek-v4] Part 2: Hash MoE and SwiGLU clamp#4481
Conversation
Victarry
left a comment
There was a problem hiding this comment.
Generally looks good to me. Co-reviewd with AI, please take a look~
|
|
||
| activation_func_clamp_value: Optional[float] = None | ||
| """Clamp the output of the linear_fc1 in the activation function. Only used when activation_func |
There was a problem hiding this comment.
[SUGGESTION] The docstring now claims activation_func_clamp_value works for quick_gelu or swiglu, but in this PR clamp is only wired through the weighted SwiGLU path (weighted_bias_swiglu_impl -> WeightedSwiGLUFunction). The dense path bias_swiglu_impl (used by non-MoE / non-token-weighted SwiGLU MLP) does not accept clamp_value and silently ignores it.
So a user setting activation_func_clamp_value on a model with dense SwiGLU MLP layers will see zero effect, with no warning.
Suggestion: either
- extend
bias_swiglu_impl/BiasSwiGLUFunction/SwiGLUFunctionto also accept and respectclamp_value, or - narrow the docstring to "weighted SwiGLU (MoE) only" and add a runtime check that warns or asserts when
activation_func_clamp_value > 0is set on a dense-SwiGLU configuration.
There was a problem hiding this comment.
fixed the docstring
|
/ok to test df71a39 |
|
/ok to test c51d461 |
|
LGTM |
|
/ok to test 608a1b2 |
|
/ok to test dfadf2e |
|
/ok to test d6a9445 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25151919314 |
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> [Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version. Megatron LM PR: DeepSeek-V4: NVIDIA#4458 NVIDIA#4481 NVIDIA#4518 mHC: NVIDIA#2943 ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> [New features] ### PR Description <!-- Describe what you’ve done --> Add DeepSeek V4 model into FlagScale and Megatron-FL Supported: 1. CSA and HCA 2. Hash Router 3. mHC 4. Engram(optional) Unsupported: 1. Sqrtsoftpuls router score function. ✅ 2. mHC recompute. ✅ 3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅ 4. Any infra optimizations. ### NOTE: This is only a draft pr, please reivew to give more suggestions. such as: 1. File structure. - All modules are moved into Megatron-FL ### Next plan: 1. Distributed training. ✅ 3. Muon optimizer with Zero 1 adaptation. 🚧 4. Low precision is out of scope of this pr, limited by resource. 5. Maybe context parallel for sparse attention. 6. Welcome to give more suggestions. --------- Co-authored-by: Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com>
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> [Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version. Megatron LM PR: DeepSeek-V4: NVIDIA/Megatron-LM#4458 NVIDIA/Megatron-LM#4481 NVIDIA/Megatron-LM#4518 mHC: NVIDIA/Megatron-LM#2943 ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> [New features] ### PR Description <!-- Describe what you’ve done --> Add DeepSeek V4 model into FlagScale and Megatron-FL Supported: 1. CSA and HCA 2. Hash Router 3. mHC 4. Engram(optional) Unsupported: 1. Sqrtsoftpuls router score function. ✅ 2. mHC recompute. ✅ 3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅ 4. Any infra optimizations. ### NOTE: This is only a draft pr, please reivew to give more suggestions. such as: 1. File structure. - **All modules are moved to Megatron-FL. Only model_builder is left in Flagscale.** - Delete Engram related CI or not? ### Next plan: 1. Distributed training. ✅ 3. Muon optimizer with Zero 1 adaptation. 😢 4. Low precision is out of scope of this pr, limited by resource. 5. Maybe context parallel for sparse attention. 6. Welcome to give more suggestions. --------- Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>
What does this PR do ?
We will create several PRs to functionally support DeepSeek-v4 training. This is the second one.
Add DeepSeek-v4 Hash MoE and SwiGLU clamp.
--moe-n-hash-layers.--activation-func-clamp-value.Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.