[recipe] feat: Add DeepSeek-V4-Flash pretraining recipes#3893
Conversation
1c4ed05 to
6f7273a
Compare
| from .deepseek_v4 import ( | ||
| deepseek_v4_flash_pretrain_muon_config, | ||
| deepseek_v4_flash_pretrain_mxfp8_config, | ||
| ) |
There was a problem hiding this comment.
set_deepseek_v4_pipeline_model_parallel_layout is a public function (no underscore prefix) but is not re-exported here or added to __all__, unlike the V3 equivalent set_deepseek_v3_pipeline_model_parallel_layout. If it's intended for user customization of pipeline layouts, it should be exported. If it's internal-only, prefix it with _.
| ) | |
| from .deepseek_v4 import ( | |
| deepseek_v4_flash_pretrain_muon_config, | |
| deepseek_v4_flash_pretrain_mxfp8_config, | |
| set_deepseek_v4_pipeline_model_parallel_layout, | |
| ) |
| clip_grad=1.0, | ||
| ) | ||
| opt_cfg.optimizer = "muon" |
There was a problem hiding this comment.
Nit: distributed_muon_with_cosine_annealing sets optimizer="dist_muon", then line 229 immediately overwrites it to "muon". This controls layer_wise_distributed_optimizer in optim.py (line 105). The intent (non-distributed Muon with no_shard) is correct, but calling a function named distributed_muon_* only to undo the "distributed" part is confusing. Consider adding a brief comment here explaining why it's overridden, e.g. "DSv4 Muon uses non-layer-wise optimizer dispatch".
|
Light Code Review - No critical bugs found. Two inline comments posted (missing init export, confusing optimizer override). Main feedback is on test coverage gaps. Test coverage gaps: (1) set_deepseek_v4_pipeline_model_parallel_layout is untested - the V3 equivalent has dedicated tests but the V4 function with real logic (divmod layer distribution, embedding/MTP/loss placement) has none. Currently FakeModelCfg lacks num_layers and mtp_num_layers so the function silently bails out to None. (2) Error paths are untested - ValueError for Muon+MXFP8 and invalid optimizer type are both one-line pytest.raises tests. (3) Mixed-precision assertions for Muon recipe - the MXFP8 test verifies fp8_recipe and fp8_param_gather but the Muon test does not assert anything about mixed_precision. Suggested test cases: No perf tests impacted. |
|
Light Code Review - No critical bugs found. Two inline comments posted (missing init export, confusing optimizer override). Main feedback is on test coverage gaps. Test coverage gaps: (1) set_deepseek_v4_pipeline_model_parallel_layout is untested - the V3 equivalent has dedicated tests but the V4 function with real logic (divmod layer distribution, embedding/MTP/loss placement) has none, since _FakeModelCfg lacks num_layers and mtp_num_layers so the function silently bails to None. (2) Error paths are untested - ValueError for Muon+MXFP8 and invalid optimizer type are both one-line pytest.raises tests. (3) Mixed-precision assertions for Muon recipe - the MXFP8 test verifies fp8_recipe and fp8_param_gather but the Muon test does not assert anything about mixed_precision (should be plain BF16, no FP8). Suggested test cases: No perf tests impacted. |
6f7273a to
502890d
Compare
Signed-off-by: weijiac <weijiac@nvidia.com>
502890d to
24ccffa
Compare
|
|
||
| CASE_NAME="${CASE_NAME:-${RECIPE_NAME}}" | ||
| JOB_ID="${SLURM_JOB_ID:-manual}" | ||
| OUTDIR="${OUTDIR:-${WORKSPACE}/results/${MODEL_NAME}_${CASE_NAME}_${JOB_ID}}" |
There was a problem hiding this comment.
can we simplify these -- only keep the parameters that users would possibly change. Keep the other ones at a default value
| from megatron.bridge.training.mixed_precision import bf16_mixed, bf16_with_mxfp8_mixed | ||
|
|
||
|
|
||
| DSV4_CSA_BACKEND = Literal["unfused", "cudnn_dsa", "tilelang_official", "flashmla_official"] |
There was a problem hiding this comment.
I think we can remove tilelang_official and flashmla_official now, they were added while debugging forward pass parity
9c4d972 to
f0efc48
Compare
| return cfg | ||
|
|
||
|
|
||
| def deepseek_v4_flash_pretrain_mxfp8_config(hf_path: str = "deepseek-ai/DeepSeek-V4-Flash") -> ConfigContainer: |
There was a problem hiding this comment.
this seems way too many levels created by agent, just use a single at most 2 levels of config system, flatten as much as you can.
f0efc48 to
24ccffa
Compare
|
/ok to test 8ecd047 |
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
/ok to test 0bfb0eb |
…#3893) Signed-off-by: weijiac <weijiac@nvidia.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
What does this PR do?
Adds pretraining recipes for DeepSeek-V4-Flash on Blackwell, plus a Slurm
launcher example, unit tests, and a functional test. The base recipe targets
TP=1 / PP=4 / EP=8 with selective activation recompute, an MTP-aware pipeline
layout, and BF16. Two variants extend the base for the two supported
optimizer + precision combinations.
Changelog
src/megatron/bridge/recipes/deepseek/deepseek_v4.py(new): three pretrainconfigs and a pipeline-layout helper.
deepseek_v4_flash_pretrain_config()— BF16 base; TP=1, PP=4, EP=8,selective recompute (
moe_act,mhc), MTP placed on the last PP stagevia
pipeline_model_parallel_layout.deepseek_v4_flash_pretrain_mxfp8_config()— Adam + MXFP8 training,BF16 MTP / validation eval,
quant_recipeselects MXFP8 for TE linears.deepseek_v4_flash_pretrain_muon_config()— Muon optimizer + BF16,non-layer-wise dispatch.
set_deepseek_v4_pipeline_model_parallel_layout()helper builds theeven decoder layout with MTP and loss on the last PP rank.
src/megatron/bridge/recipes/deepseek/__init__.py: re-export the newconfigs and helper.
examples/models/deepseek_v4/README.md: document the new recipes andSlurm launcher.
examples/models/deepseek_v4/slurm_pretrain.sh: Slurm sbatch script forthe new recipes.
tests/unit_tests/recipes/test_deepseek_recipes.py: extend coverage toDSv4 configs.
tests/functional_tests/test_groups/recipes/test_deepseek_recipes_pretrain.py(new): L0/L1 pretrain smoke for DSv4 variants.
GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI.
Before your PR is "Ready for review"
Pre checks: