[recipe] feat: Add DeepSeek-V4-Flash pretraining recipes by weijiac0619 · Pull Request #3893 · NVIDIA-NeMo/Megatron-Bridge

weijiac0619 · 2026-05-19T22:11:14Z

What does this PR do?

Adds pretraining recipes for DeepSeek-V4-Flash on Blackwell, plus a Slurm
launcher example, unit tests, and a functional test. The base recipe targets
TP=1 / PP=4 / EP=8 with selective activation recompute, an MTP-aware pipeline
layout, and BF16. Two variants extend the base for the two supported
optimizer + precision combinations.

Changelog

src/megatron/bridge/recipes/deepseek/deepseek_v4.py (new): three pretrain
configs and a pipeline-layout helper.
- deepseek_v4_flash_pretrain_config() — BF16 base; TP=1, PP=4, EP=8,
  selective recompute (moe_act, mhc), MTP placed on the last PP stage
  via pipeline_model_parallel_layout.
- deepseek_v4_flash_pretrain_mxfp8_config() — Adam + MXFP8 training,
  BF16 MTP / validation eval, quant_recipe selects MXFP8 for TE linears.
- deepseek_v4_flash_pretrain_muon_config() — Muon optimizer + BF16,
  non-layer-wise dispatch.
- set_deepseek_v4_pipeline_model_parallel_layout() helper builds the
  even decoder layout with MTP and loss on the last PP rank.
src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new
configs and helper.
examples/models/deepseek_v4/README.md: document the new recipes and
Slurm launcher.
examples/models/deepseek_v4/slurm_pretrain.sh: Slurm sbatch script for
the new recipes.
tests/unit_tests/recipes/test_deepseek_recipes.py: extend coverage to
DSv4 configs.
tests/functional_tests/test_groups/recipes/test_deepseek_recipes_pretrain.py
(new): L0/L1 pretrain smoke for DSv4 variants.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

copy-pr-bot · 2026-05-19T22:11:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-28T23:15:27Z

+from .deepseek_v4 import (
+    deepseek_v4_flash_pretrain_muon_config,
+    deepseek_v4_flash_pretrain_mxfp8_config,
+)


set_deepseek_v4_pipeline_model_parallel_layout is a public function (no underscore prefix) but is not re-exported here or added to __all__, unlike the V3 equivalent set_deepseek_v3_pipeline_model_parallel_layout. If it's intended for user customization of pipeline layouts, it should be exported. If it's internal-only, prefix it with _.

Suggested change

)

from .deepseek_v4 import (

deepseek_v4_flash_pretrain_muon_config,

deepseek_v4_flash_pretrain_mxfp8_config,

set_deepseek_v4_pipeline_model_parallel_layout,

)

claude · 2026-05-28T23:15:30Z

+            clip_grad=1.0,
+        )
+        opt_cfg.optimizer = "muon"


Nit: distributed_muon_with_cosine_annealing sets optimizer="dist_muon", then line 229 immediately overwrites it to "muon". This controls layer_wise_distributed_optimizer in optim.py (line 105). The intent (non-distributed Muon with no_shard) is correct, but calling a function named distributed_muon_* only to undo the "distributed" part is confusing. Consider adding a brief comment here explaining why it's overridden, e.g. "DSv4 Muon uses non-layer-wise optimizer dispatch".

claude · 2026-05-28T23:17:10Z

Light Code Review - No critical bugs found. Two inline comments posted (missing init export, confusing optimizer override). Main feedback is on test coverage gaps. Test coverage gaps: (1) set_deepseek_v4_pipeline_model_parallel_layout is untested - the V3 equivalent has dedicated tests but the V4 function with real logic (divmod layer distribution, embedding/MTP/loss placement) has none. Currently FakeModelCfg lacks num_layers and mtp_num_layers so the function silently bails out to None. (2) Error paths are untested - ValueError for Muon+MXFP8 and invalid optimizer type are both one-line pytest.raises tests. (3) Mixed-precision assertions for Muon recipe - the MXFP8 test verifies fp8_recipe and fp8_param_gather but the Muon test does not assert anything about mixed_precision. Suggested test cases: No perf tests impacted.

claude · 2026-05-28T23:19:29Z

Light Code Review - No critical bugs found. Two inline comments posted (missing init export, confusing optimizer override). Main feedback is on test coverage gaps. Test coverage gaps: (1) set_deepseek_v4_pipeline_model_parallel_layout is untested - the V3 equivalent has dedicated tests but the V4 function with real logic (divmod layer distribution, embedding/MTP/loss placement) has none, since _FakeModelCfg lacks num_layers and mtp_num_layers so the function silently bails to None. (2) Error paths are untested - ValueError for Muon+MXFP8 and invalid optimizer type are both one-line pytest.raises tests. (3) Mixed-precision assertions for Muon recipe - the MXFP8 test verifies fp8_recipe and fp8_param_gather but the Muon test does not assert anything about mixed_precision (should be plain BF16, no FP8). Suggested test cases: No perf tests impacted.

Signed-off-by: weijiac <weijiac@nvidia.com>

cuichenx · 2026-05-28T23:42:55Z

+
+CASE_NAME="${CASE_NAME:-${RECIPE_NAME}}"
+JOB_ID="${SLURM_JOB_ID:-manual}"
+OUTDIR="${OUTDIR:-${WORKSPACE}/results/${MODEL_NAME}_${CASE_NAME}_${JOB_ID}}"


can we simplify these -- only keep the parameters that users would possibly change. Keep the other ones at a default value

cuichenx · 2026-05-28T23:44:26Z

+from megatron.bridge.training.mixed_precision import bf16_mixed, bf16_with_mxfp8_mixed
+
+
+DSV4_CSA_BACKEND = Literal["unfused", "cudnn_dsa", "tilelang_official", "flashmla_official"]


I think we can remove tilelang_official and flashmla_official now, they were added while debugging forward pass parity

yaoyu-33 · 2026-05-29T02:51:09Z

+    return cfg
+
+
+def deepseek_v4_flash_pretrain_mxfp8_config(hf_path: str = "deepseek-ai/DeepSeek-V4-Flash") -> ConfigContainer:


this seems way too many levels created by agent, just use a single at most 2 levels of config system, flatten as much as you can.

Signed-off-by: weijiac <weijiac@nvidia.com>

cuichenx · 2026-05-30T00:08:58Z

/ok to test 8ecd047

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-05-31T06:58:28Z

/ok to test 0bfb0eb

…#3893) Signed-off-by: weijiac <weijiac@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

cuichenx added the high-priority label May 28, 2026

cuichenx mentioned this pull request May 28, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

weijiac0619 force-pushed the chcui/dsv4-train-pr3562-pr4518 branch from 1c4ed05 to 6f7273a Compare May 28, 2026 23:09

weijiac0619 marked this pull request as ready for review May 28, 2026 23:12

weijiac0619 requested a review from cuichenx May 28, 2026 23:15

claude Bot reviewed May 28, 2026

View reviewed changes

weijiac0619 force-pushed the chcui/dsv4-train-pr3562-pr4518 branch from 6f7273a to 502890d Compare May 28, 2026 23:24

dsv4 recipes

24ccffa

Signed-off-by: weijiac <weijiac@nvidia.com>

weijiac0619 force-pushed the chcui/dsv4-train-pr3562-pr4518 branch from 502890d to 24ccffa Compare May 28, 2026 23:25

cuichenx reviewed May 28, 2026

View reviewed changes

weijiac0619 force-pushed the chcui/dsv4-train-pr3562-pr4518 branch 2 times, most recently from 9c4d972 to f0efc48 Compare May 29, 2026 00:29

yaoyu-33 reviewed May 29, 2026

View reviewed changes

weijiac0619 force-pushed the chcui/dsv4-train-pr3562-pr4518 branch from f0efc48 to 24ccffa Compare May 29, 2026 02:51

yaoyu-33 added area:recipe Training recipes and launch configs feature New capabilities, enhancements, or enablement work needs-more-tests Requires additional L0 and L1 test coverage before merge waiting-on-customer Waiting on the original author to respond labels May 29, 2026

comments

92a1b92

Signed-off-by: weijiac <weijiac@nvidia.com>

Meirtz mentioned this pull request May 29, 2026

[recipe] feat: Add DeepSeek-V4 training smoke configs #3923

Closed

4 tasks

yaoyu-33 reviewed May 29, 2026

View reviewed changes

Comment thread examples/models/deepseek_v4/README.md

weijiac0619 added 2 commits May 29, 2026 15:34

refactor

e9b85e4

Signed-off-by: weijiac <weijiac@nvidia.com>

clean

d2deb86

Signed-off-by: weijiac <weijiac@nvidia.com>

cuichenx changed the title ~~Chcui/dsv4 train pr3562 pr4518~~ [recipe] feat: Add DeepSeek-V4-Flash pretraining recipes May 30, 2026

cuichenx previously approved these changes May 30, 2026

View reviewed changes

cuichenx removed the needs-more-tests Requires additional L0 and L1 test coverage before merge label May 30, 2026

Merge branch 'main' into chcui/dsv4-train-pr3562-pr4518

8ecd047

copy-pr-bot Bot temporarily deployed to public May 30, 2026 00:09 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 00:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 30, 2026 00:40 Inactive

fix: split aliased import to satisfy ruff isort

0bfb0eb

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx dismissed their stale review via 0bfb0eb May 31, 2026 06:57

copy-pr-bot Bot temporarily deployed to public May 31, 2026 06:58 Inactive

copy-pr-bot Bot temporarily deployed to test May 31, 2026 06:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 07:06 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 07:26 Inactive

cuichenx approved these changes May 31, 2026

View reviewed changes

cuichenx merged commit 0eb1932 into main May 31, 2026
173 of 175 checks passed

cuichenx deleted the chcui/dsv4-train-pr3562-pr4518 branch May 31, 2026 20:28

Meirtz mentioned this pull request Jun 1, 2026

[model] fix: default dsa_indexer_loss_coeff to a non-None value in DeepSeek-V4 bridge #4003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[recipe] feat: Add DeepSeek-V4-Flash pretraining recipes#3893

[recipe] feat: Add DeepSeek-V4-Flash pretraining recipes#3893
cuichenx merged 6 commits into
mainfrom
chcui/dsv4-train-pr3562-pr4518

weijiac0619 commented May 19, 2026 •

edited by cuichenx

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

claude Bot May 28, 2026

Uh oh!

claude Bot May 28, 2026

Uh oh!

claude Bot commented May 28, 2026

Uh oh!

claude Bot commented May 28, 2026

Uh oh!

cuichenx May 28, 2026

Uh oh!

cuichenx May 28, 2026

Uh oh!

yaoyu-33 May 29, 2026

Uh oh!

weijiac0619 May 29, 2026

Uh oh!

Uh oh!

cuichenx commented May 30, 2026

Uh oh!

cuichenx commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		from megatron.bridge.training.mixed_precision import bf16_mixed, bf16_with_mxfp8_mixed


		DSV4_CSA_BACKEND = Literal["unfused", "cudnn_dsa", "tilelang_official", "flashmla_official"]

		return cfg


		def deepseek_v4_flash_pretrain_mxfp8_config(hf_path: str = "deepseek-ai/DeepSeek-V4-Flash") -> ConfigContainer:

Conversation

weijiac0619 commented May 19, 2026 • edited by cuichenx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

claude Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 28, 2026

Uh oh!

claude Bot commented May 28, 2026

Uh oh!

cuichenx May 28, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx May 28, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

weijiac0619 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cuichenx commented May 30, 2026

Uh oh!

cuichenx commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weijiac0619 commented May 19, 2026 •

edited by cuichenx

Loading