[dev] [DeepSeek-v4] Part 3: MTP support with mHC and new mHC contract by hxbai · Pull Request #4518 · NVIDIA/Megatron-LM

hxbai · 2026-04-29T09:49:47Z

What does this PR do ?

We will create several PRs to functionally support DeepSeek-v4 training. This is the third one.

Add DeepSeek-v4 MTP support with mHC and new mHC contract.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-29T09:49:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaox12 · 2026-05-06T03:40:54Z

/claude strict-review

claude · 2026-05-06T03:46:02Z

+            if self.config.enable_hyper_connections:
+                hc_mult = self.config.num_residual_streams
+                hc_dim = self.config.hidden_size * hc_mult
+                self.hc_head_fn = nn.Parameter(torch.randn(hc_mult, hc_dim))
+                self.hc_head_base = nn.Parameter(torch.zeros(hc_mult))
+                self.hc_head_scale = nn.Parameter(torch.ones(1))
+                nn.init.xavier_uniform_(self.hc_head_fn)
+                if self.config.sequence_parallel:
+                    setattr(self.hc_head_fn, 'sequence_parallel', True)
+                    setattr(self.hc_head_base, 'sequence_parallel', True)
+                    setattr(self.hc_head_scale, 'sequence_parallel', True)


[IMPORTANT Correctness] hc_head_fn, hc_head_base, and hc_head_scale are bare nn.Parameter objects, not nn.Module children. TransformerBlock.sharded_state_dict() has custom logic that only iterates self.named_children() (layers + final_layernorm, etc.) — it does not call self._save_to_state_dict() like the base MegatronModule.sharded_state_dict() does. As a result, these three learned parameters will be silently dropped during distributed checkpoint save and won't be restored on resume, causing training divergence after checkpoint reload.

The same parameters on MultiTokenPredictionLayer are fine because its sharded_state_dict calls super().sharded_state_dict(), which does handle bare parameters.

Fix: Add explicit handling for these parameters in TransformerBlock.sharded_state_dict(). For example, after the named_children loop, call self._save_to_state_dict(...) for the standalone parameters, or wrap them in a small nn.Module so they're picked up by named_children().

@hxbai this one make sense and I think without wrapping those are not handled correctly by dist ckpting. These parameters are meant to be loaded on restart, right?

claude · 2026-05-06T03:46:20Z

+            if mhc_multistream is not None:
+                mhc_chunks.append(hidden_states)
+                hidden_states_list.append(self.layers[layer_idx]._postprocess(hidden_states))


[SUGGESTION Simplification] _postprocess is called here to contract each MTP layer's multi-stream output for the loss list. This works correctly, but it means learned_output_contract + final_layernorm are computed for each MTP depth even though the multi-stream tensor was already computed in the same iteration.

Consider caching the contracted result inside _proj_and_transformer_layer (compute it but don't apply it to the returned tensor) to avoid the redundant forward through _postprocess. Not a correctness issue, but calling _postprocess externally on a "raw" tensor that the layer itself chose not to postprocess creates a subtle coupling between the caller and the layer's internal mhc_enabled flag.

claude · 2026-05-06T03:46:27Z

+def learned_output_contract(
+    hidden_states: Tensor, head_fn: Tensor, base: Tensor, scale: Tensor, n: int, eps: float
+) -> Tensor:
+    """Learned output contraction: n-stream → 1-stream via sigmoid-gated weighted sum."""
+    dtype = hidden_states.dtype
+    hidden_states = hidden_states.to(torch.float32)
+    head_fn = head_fn.to(torch.float32)
+    base = base.to(torch.float32)
+    scale = scale.to(torch.float32)
+    rsqrt = torch.rsqrt(hidden_states.square().mean(-1, keepdim=True) + eps)
+    mixes = F.linear(hidden_states, head_fn) * rsqrt
+    pre = torch.sigmoid(mixes * scale + base) + 1e-6
+    y = torch.sum(pre.unsqueeze(-1) * hidden_states.view(*hidden_states.shape[:-1], n, -1), dim=-2)
+    return y.to(dtype)


[SUGGESTION Simplification] The 1e-6 offset added to pre (line 107) prevents the sigmoid gating weights from being exactly zero. However, this also means the minimum contribution per stream is always positive (1e-6), which accumulates to n * 1e-6 over all streams. This is negligible for typical hidden sizes, but it's worth noting that eps (the RMS epsilon) and this 1e-6 serve different numerical stability purposes — consider making the sigmoid floor configurable or documenting its purpose in the docstring.

Also, head_fn has shape [n, n*h] which means the F.linear projects the full concatenated multi-stream vector down to n mixing weights. This is an O(n² * h) operation per token. For large n, this could be meaningful. A comment noting this design choice (full-rank vs. per-stream factored projection) would help future readers.

replaced with eps

claude · 2026-05-06T03:46:31Z

+            # DSv4 reference (DS-Inf) RoPE is pure rotation (norm-preserving). Yarn's
+            # concentration factor (mscale) is NOT part of the DSv4 model contract --
+            # the model relies on Q/KV RMS-norm + unit-magnitude rotation. Force 1.0.
+            mscale = 1.0


[SUGGESTION Simplification] This mscale = 1.0 override is duplicated in three places across two files (csa.py and deepseek_v4_hybrid_attention.py ×2). All three share the identical comment and logic pattern. Consider handling this at a single point — for example, in the rotary embedding module's constructor or a config-driven flag — rather than patching the returned value at every call site. This would also prevent future call sites from forgetting the override.

This is the special behavior of DSv4 and no need to put it into a standalone position.

claude · 2026-05-06T03:46:34Z

+        if self.mhc_enabled:
+            hc_mult = self.config.num_residual_streams
+            hc_dim = self.config.hidden_size * hc_mult
+            self.hc_head_fn = nn.Parameter(torch.randn(hc_mult, hc_dim))
+            self.hc_head_base = nn.Parameter(torch.zeros(hc_mult))
+            self.hc_head_scale = nn.Parameter(torch.ones(1))
+            nn.init.xavier_uniform_(self.hc_head_fn)
+            if self.config.sequence_parallel:
+                setattr(self.hc_head_fn, 'sequence_parallel', True)
+                setattr(self.hc_head_base, 'sequence_parallel', True)
+                setattr(self.hc_head_scale, 'sequence_parallel', True)


[SUGGESTION Simplification] This hc_head_* parameter initialization block is duplicated verbatim in TransformerBlock.__init__ (lines 388–398). Consider extracting a small helper (e.g., _init_learned_contraction_params(config) returning a ParameterDict or small nn.Module) to avoid the duplication and ensure both sites stay in sync.

claude

Review Summary — PR #4518: [dev] [DeepSeek-v4] Part 3: MTP support with mHC and new mHC contract

Findings: CRITICAL: 0 | IMPORTANT: 3 | SUGGESTION: 4

What this PR does

mHC + MTP integration: Enables Multi-Token Prediction layers to work with manifold-constrained Hyper-Connections (mHC). MTP layers get separate e_proj/h_proj (instead of the concatenated eh_proj) to operate per-stream, and multi-stream state flows through the MTP loop without premature contraction.
New learned output contraction: Replaces the simple stream-averaging HyperConnectionModule.output_contract with learned_output_contract — a sigmoid-gated weighted sum with learnable hc_head_fn, hc_head_base, hc_head_scale parameters. This applies to all HC models, not only HC+MTP.
DSv4 mscale fix: Forces mscale = 1.0 in the non-fused RoPE path for CSA and DSv4 hybrid attention, since DSv4 uses norm-preserving rotation without Yarn's concentration factor.
Removes the HC+MTP incompatibility validation from TransformerConfig.__post_init__.

Most impactful findings

TransformerBlock checkpoint bug (IMPORTANT): The new hc_head_fn/base/scale parameters are bare nn.Parameter objects on TransformerBlock. The block's custom sharded_state_dict() only iterates named_children() and never calls _save_to_state_dict(), so these parameters will be silently dropped from distributed checkpoints. Training will resume with randomly-initialized contraction weights. The same parameters on MultiTokenPredictionLayer are fine (its sharded_state_dict calls super()).
Breaking change for existing HC models (IMPORTANT): The contraction method change (averaging → learned) applies to all enable_hyper_connections=True models. Old checkpoints won't have the new parameters. This should be documented as a breaking change, or gated behind a flag.
hybrid_model.py tuple return not handled (IMPORTANT): TransformerBlock.forward() now returns a tuple when HC+MTP are both active, but hybrid_model.py assigns the decoder output directly to hidden_states without unpacking.

Overall assessment

The algorithmic design is sound — the multi-stream dataflow through the MTP loop, the per-stream projection split, and the learned contraction function are all correctly implemented. The unit tests are thorough and cover constructor shapes, block return types, and E2E forward+backward. The checkpoint save/load bug is the highest-priority issue and should be fixed before merge. The backward compatibility concern is worth discussing given this targets dev.

hxbai · 2026-05-08T11:19:50Z

/ok to test 726ab33

hxbai · 2026-05-09T01:39:50Z

/ok to test c441614

svcnvidia-nemo-ci · 2026-05-09T03:05:07Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25589952024

hxbai · 2026-05-14T07:49:26Z

/ok to test 60ff377

svcnvidia-nemo-ci · 2026-05-14T09:26:46Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25852541556

svcnvidia-nemo-ci · 2026-05-14T10:10:22Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25854373953

### PR Category  [Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version. Megatron LM PR: DeepSeek-V4: NVIDIA#4458 NVIDIA#4481 NVIDIA#4518 mHC: NVIDIA#2943 ### PR Types  [New features] ### PR Description  Add DeepSeek V4 model into FlagScale and Megatron-FL Supported: 1. CSA and HCA 2. Hash Router 3. mHC 4. Engram(optional) Unsupported: 1. Sqrtsoftpuls router score function. ✅ 2. mHC recompute. ✅ 3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅ 4. Any infra optimizations. ### NOTE: This is only a draft pr, please reivew to give more suggestions. such as: 1. File structure. - All modules are moved into Megatron-FL ### Next plan: 1. Distributed training. ✅ 3. Muon optimizer with Zero 1 adaptation. 🚧 4. Low precision is out of scope of this pr, limited by resource. 5. Maybe context parallel for sparse attention. 6. Welcome to give more suggestions. --------- Co-authored-by: Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by: Yuzhong Wang <yuzhongw@nvidia.com>

### PR Category  [Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version. Megatron LM PR: DeepSeek-V4: NVIDIA/Megatron-LM#4458 NVIDIA/Megatron-LM#4481 NVIDIA/Megatron-LM#4518 mHC: NVIDIA/Megatron-LM#2943 ### PR Types  [New features] ### PR Description  Add DeepSeek V4 model into FlagScale and Megatron-FL Supported: 1. CSA and HCA 2. Hash Router 3. mHC 4. Engram(optional) Unsupported: 1. Sqrtsoftpuls router score function. ✅ 2. mHC recompute. ✅ 3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅ 4. Any infra optimizations. ### NOTE: This is only a draft pr, please reivew to give more suggestions. such as: 1. File structure. - **All modules are moved to Megatron-FL. Only model_builder is left in Flagscale.** - Delete Engram related CI or not? ### Next plan: 1. Distributed training. ✅ 3. Muon optimizer with Zero 1 adaptation. 😢 4. Low precision is out of scope of this pr, limited by resource. 5. Maybe context parallel for sparse attention. 6. Welcome to give more suggestions. --------- Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>

hxbai self-assigned this Apr 29, 2026

hxbai added the dev branch Dev branch related issues and development label Apr 29, 2026

hxbai mentioned this pull request Apr 29, 2026

DeepSeek-V4 training support #4468

Open

3 tasks

hxbai marked this pull request as ready for review April 30, 2026 06:09

hxbai requested review from a team as code owners April 30, 2026 06:09

svcnvidia-nemo-ci added the complexity: medium label Apr 30, 2026

kainzhong mentioned this pull request May 1, 2026

[Common, PyTorch] Improve mHC to match DeepSeek's implementation NVIDIA/TransformerEngine#2953

Closed

13 tasks

hxbai force-pushed the dsv4_mtp branch from 1efad33 to d9d5eaa Compare May 6, 2026 03:10

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_block.py

claude Bot reviewed May 6, 2026

View reviewed changes

Comment thread megatron/core/transformer/transformer_block.py

claude Bot reviewed May 6, 2026

View reviewed changes

FDecaYed approved these changes May 6, 2026

View reviewed changes

This was referenced May 7, 2026

Deepseek v4 Support flagos-ai/FlagScale#1195

Merged

Deepseek v4 Support flagos-ai/Megatron-LM-FL#38

Merged

copy-pr-bot Bot temporarily deployed to test May 8, 2026 11:21 Inactive

hxbai force-pushed the dsv4_mtp branch from 9bf4db3 to c441614 Compare May 9, 2026 01:39

copy-pr-bot Bot temporarily deployed to test May 9, 2026 01:40 Inactive

hxbai added this pull request to the merge queue May 9, 2026

hxbai removed this pull request from the merge queue due to a manual request May 9, 2026

hxbai and others added 15 commits May 14, 2026 07:48

fix new contract dtype

0c99d96

format and add tests

5b27dc9

fix mscale

ecfb6f3

fix state_dict

e1fc689

fix eps in learned_output_contract

6d80f37

fix tests

c7adacb

add yarn arg original-max-position-embeddings

61f71c8

fix mtp spec; add tflops calc

e9f95e3

add functional test

8664bd9

fix test

4ace391

fix test config

ffb96e9

fix indexer loss logging; fix ckpt

8e4f78c

add swiglu clamp to shared expert

9d2fcd6

fix test

44e12b1

update golden values due to new mhc contract

60ff377

hxbai force-pushed the dsv4_mtp branch from be7dd1b to 60ff377 Compare May 14, 2026 07:49

hxbai enabled auto-merge May 14, 2026 07:49

copy-pr-bot Bot temporarily deployed to test May 14, 2026 07:50 Inactive

hxbai added this pull request to the merge queue May 14, 2026

Merged via the queue into NVIDIA:dev with commit 2e55168 May 14, 2026
66 checks passed

hxbai deleted the dsv4_mtp branch May 14, 2026 11:31

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

hxbai mentioned this pull request May 19, 2026

[main] [DeepSeek-v4] MTP support with mHC and new mHC contract #4869

Draft

5 tasks

cuichenx mentioned this pull request Jun 3, 2026

[training] fix: Update DeepSeek-V4 FLOPs calculation NVIDIA-NeMo/Megatron-Bridge#4128

Merged

Conversation

hxbai commented Apr 29, 2026

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

claude Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

FDecaYed May 6, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

claude Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Review Summary — PR #4518: [dev] [DeepSeek-v4] Part 3: MTP support with mHC and new mHC contract

What this PR does

Most impactful findings

Overall assessment

Uh oh!

hxbai commented May 8, 2026

Uh oh!

hxbai commented May 9, 2026

Uh oh!

svcnvidia-nemo-ci commented May 9, 2026

Uh oh!

Uh oh!

hxbai commented May 14, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants