[dev] [DeepSeek-v4] Part 2: Hash MoE and SwiGLU clamp by hxbai · Pull Request #4481 · NVIDIA/Megatron-LM

hxbai · 2026-04-27T15:07:46Z

What does this PR do ?

We will create several PRs to functionally support DeepSeek-v4 training. This is the second one.

Add DeepSeek-v4 Hash MoE and SwiGLU clamp.

Add new argument --moe-n-hash-layers.
Add SwiGLU support to --activation-func-clamp-value.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-27T15:07:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Victarry

Generally looks good to me. Co-reviewd with AI, please take a look~

Victarry · 2026-04-28T09:30:44Z


    activation_func_clamp_value: Optional[float] = None
    """Clamp the output of the linear_fc1 in the activation function. Only used when activation_func


[SUGGESTION] The docstring now claims activation_func_clamp_value works for quick_gelu or swiglu, but in this PR clamp is only wired through the weighted SwiGLU path (weighted_bias_swiglu_impl -> WeightedSwiGLUFunction). The dense path bias_swiglu_impl (used by non-MoE / non-token-weighted SwiGLU MLP) does not accept clamp_value and silently ignores it.

So a user setting activation_func_clamp_value on a model with dense SwiGLU MLP layers will see zero effect, with no warning.

Suggestion: either

extend bias_swiglu_impl / BiasSwiGLUFunction / SwiGLUFunction to also accept and respect clamp_value, or

narrow the docstring to "weighted SwiGLU (MoE) only" and add a runtime check that warns or asserts when activation_func_clamp_value > 0 is set on a dense-SwiGLU configuration.

fixed the docstring

hxbai · 2026-04-29T10:00:07Z

/ok to test df71a39

hxbai · 2026-04-29T10:12:20Z

/ok to test c51d461

Victarry · 2026-04-29T10:15:58Z

LGTM

hxbai · 2026-04-29T12:07:55Z

/ok to test 608a1b2

hxbai · 2026-04-29T14:53:33Z

/ok to test dfadf2e

hxbai · 2026-04-30T03:59:06Z

/ok to test d6a9445

svcnvidia-nemo-ci · 2026-04-30T06:58:02Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25151919314

### PR Category  [Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version. Megatron LM PR: DeepSeek-V4: NVIDIA#4458 NVIDIA#4481 NVIDIA#4518 mHC: NVIDIA#2943 ### PR Types  [New features] ### PR Description  Add DeepSeek V4 model into FlagScale and Megatron-FL Supported: 1. CSA and HCA 2. Hash Router 3. mHC 4. Engram(optional) Unsupported: 1. Sqrtsoftpuls router score function. ✅ 2. mHC recompute. ✅ 3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅ 4. Any infra optimizations. ### NOTE: This is only a draft pr, please reivew to give more suggestions. such as: 1. File structure. - All modules are moved into Megatron-FL ### Next plan: 1. Distributed training. ✅ 3. Muon optimizer with Zero 1 adaptation. 🚧 4. Low precision is out of scope of this pr, limited by resource. 5. Maybe context parallel for sparse attention. 6. Welcome to give more suggestions. --------- Co-authored-by: Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com>

### PR Category  [Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version. Megatron LM PR: DeepSeek-V4: NVIDIA/Megatron-LM#4458 NVIDIA/Megatron-LM#4481 NVIDIA/Megatron-LM#4518 mHC: NVIDIA/Megatron-LM#2943 ### PR Types  [New features] ### PR Description  Add DeepSeek V4 model into FlagScale and Megatron-FL Supported: 1. CSA and HCA 2. Hash Router 3. mHC 4. Engram(optional) Unsupported: 1. Sqrtsoftpuls router score function. ✅ 2. mHC recompute. ✅ 3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅ 4. Any infra optimizations. ### NOTE: This is only a draft pr, please reivew to give more suggestions. such as: 1. File structure. - **All modules are moved to Megatron-FL. Only model_builder is left in Flagscale.** - Delete Engram related CI or not? ### Next plan: 1. Distributed training. ✅ 3. Muon optimizer with Zero 1 adaptation. 😢 4. Low precision is out of scope of this pr, limited by resource. 5. Maybe context parallel for sparse attention. 6. Welcome to give more suggestions. --------- Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>

hxbai self-assigned this Apr 27, 2026

hxbai added the dev branch Dev branch related issues and development label Apr 27, 2026

Victarry reviewed Apr 28, 2026

View reviewed changes

hxbai mentioned this pull request Apr 29, 2026

DeepSeek-V4 training support #4468

Open

3 tasks

hxbai marked this pull request as ready for review April 29, 2026 08:56

hxbai requested review from a team as code owners April 29, 2026 08:56

svcnvidia-nemo-ci added the complexity: medium label Apr 29, 2026

hxbai changed the title ~~[dev] [DeepSeek-v4] Part 2: Hash MoE, SwiGLU clamp, and new mHC contract~~ [dev] [DeepSeek-v4] Part 2: Hash MoE and SwiGLU clamp Apr 29, 2026

copy-pr-bot Bot temporarily deployed to test April 29, 2026 10:13 Inactive

Victarry approved these changes Apr 29, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test April 29, 2026 12:09 Inactive

hxbai added 6 commits April 29, 2026 14:50

init

4755877

minor fix

48fc2c1

fix review comments

081c4f5

remove mHC contract modification

d5a4fef

fix link and copyright

b5e196c

fix mamba moe GOLDEN_CONFIG

6693611

hxbai force-pushed the dsv4_moe branch from 608a1b2 to 6693611 Compare April 29, 2026 14:51

fix hybrid moe GOLDEN_CONFIG

dfadf2e

copy-pr-bot Bot temporarily deployed to test April 29, 2026 14:54 Inactive

hxbai added 2 commits April 30, 2026 02:45

fix hash router with mtp

6d3285d

add tests

d6a9445

copy-pr-bot Bot temporarily deployed to test April 30, 2026 03:59 Inactive

hxbai added this pull request to the merge queue Apr 30, 2026

Merged via the queue into NVIDIA:dev with commit fe729e9 Apr 30, 2026
64 of 65 checks passed

hxbai deleted the dsv4_moe branch April 30, 2026 07:34

hxbai added a commit to hxbai/Megatron-LM that referenced this pull request Apr 30, 2026

fix missing import in PR NVIDIA#4481

80a2d91

This was referenced May 7, 2026

Deepseek v4 Support flagos-ai/FlagScale#1195

Merged

Deepseek v4 Support flagos-ai/Megatron-LM-FL#38

Merged

LiJunscs pushed a commit to LiJunscs/Megatron-LM-FL that referenced this pull request May 11, 2026

fix missing import in PR NVIDIA#4481

4120327

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

This was referenced May 19, 2026

dsv4 attn #4865

Closed

[main] [DeepSeek-v4] Hash MoE and SwiGLU clamp #4866

Draft

LiJunscs pushed a commit to LiJunscs/Megatron-LM-FL that referenced this pull request May 20, 2026

fix missing import in PR NVIDIA#4481

0c7bbde


		activation_func_clamp_value: Optional[float] = None
		"""Clamp the output of the linear_fc1 in the activation function. Only used when activation_func

Conversation

hxbai commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

Victarry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Victarry Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hxbai commented Apr 29, 2026

Uh oh!

hxbai commented Apr 29, 2026

Uh oh!

Victarry commented Apr 29, 2026

Uh oh!

hxbai commented Apr 29, 2026

Uh oh!

hxbai commented Apr 29, 2026

Uh oh!

hxbai commented Apr 30, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hxbai commented Apr 27, 2026 •

edited

Loading

hxbai Apr 29, 2026 •

edited

Loading