DeepSeek V4 Bridge by weijiac0619 · Pull Request #3562 · NVIDIA-NeMo/Megatron-Bridge

weijiac0619 · 2026-04-28T23:33:10Z

Summary

Adds DeepSeek V4 bridge support using native Hugging Face DeepSeek V4 configs.
Adds FP8/MXFP4 checkpoint import/export mappings, docs/examples, and focused coverage.
Pins the MCore dev commit with DeepSeek V4 Part 3 support and refreshes uv.lock.

Validation

Full HSG EP8 import and export roundtrip completed.
pre-commit run --all-files passed.
uv lock --check and locked uv sync dry-runs passed.

Signed-off-by: weijiac <weijiac@NVIDIA.com>

copy-pr-bot · 2026-04-28T23:33:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: weijiac <weijiac@NVIDIA.com>

- Rename linear_kv_up_proj → linear_kv_proj (MCore PR #4458) - Restore HC head mappings (hc_head_fn/base/scale) for learned_output_contract (MCore PR #4518) - Remove task-is-None guards in model_bridge.py (root cause fixed via allow_hf_name_mismatch) - Add allow_hf_name_mismatch to _HCAlphaSecondaryMapping - Handle transformers 5.x nested rope_scaling format - Handle compress_ratios/compress_rates naming + length trim - Explicit errors for missing config fields instead of silent fallbacks - AutoConfig.register: re-raise non-"already registered" errors

- Remove old test scripts (dsv4_generate.py, test_dsv4_bridge_smoke.py, test_dsv4_full_import.py) - Add new validation scripts: - dsv4_fresh_import_test.py: import + save ckpt + round-trip (all weights) + cosine sim - dsv4_fresh_generate.py: import + greedy generation with answer verification - dsv4_tiny_ref_vs_mg.py: tiny model layer-by-layer comparison vs official inference/model.py - Update copyright year to 2026

float8_e8m0fnu.to(float32) already returns 2^(e-127). The old code applied an extra 2^(x-127), producing near-zero scales that zeroed all expert weights. Also fix tiny model test init and seq_len. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: weijiac <weijiac@nvidia.com>

5 MTP params (hc_head_fn/base/scale, e_proj, h_proj) are unmapped after MCore PR #4518 changed MTP from concatenated eh_proj to separate e_proj/h_proj. This guard prevents crash during fresh import. Revert when MTP mappings are fully implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Active test scripts for DSv4 validation: - dsv4_full_cosine_report.py: per-layer + sub-layer cosine sim report - dsv4_cosine_analysis.py: capture hidden states from both official and MCore - dsv4_last_hidden_cmp.py: post-contraction + logit comparison - dsv4_fresh_import_save.py: fresh HF import with checkpoint save - dsv4-bridge-handoff.md: handoff doc with results and instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py - dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Map the three previously-unmapped per-MTP-layer Hyper-Connection head parameters (hc_head_fn / hc_head_base / hc_head_scale) using ReplicatedMapping, mirroring the existing decoder.hc_head_* triplet. Replace the concatenated _MTPEHProjMapping (legacy eh_proj) with two plain AutoMappings for e_proj and h_proj. After MCore #4518, the MTP layer with enable_hyper_connections=True instantiates these as separate ColumnParallelLinear modules; AutoMapping auto-detects column parallelism and shards along dim 0. The dead _MTPEHProjMapping class is removed. Also clarify the bridge module docstring: parity is verified only for DeepSeek-V4-Flash; Flash-Base, Pro, and Pro-Base share the architecture and quant dispatch but logit parity is unmeasured. Document the two quant variants explicitly (Flash uses FP8 attn + MXFP4 experts with F8_E8M0 scales; Flash-Base/Pro/Pro-Base use uniform FP8 with F32 scales). Verified: 4-rank EP=4 import on DeepSeek-V4-Flash-Base completes cleanly with all five previously-unmapped MTP params now resolving. Signed-off-by: chcui <chcui@nvidia.com> Signed-off-by: root <root@nvl72160-T13.cm.cluster>

The two `if task is None or task.megatron_module is None:` guards in build_conversion_tasks consumers were a temporary stopgap absorbing the five unmapped DSv4 MTP parameters (hc_head_fn/base/scale, e_proj, h_proj). With those mappings now in place (preceding commit), no None tasks reach the load/export loops, so revert to the standard `if task.megatron_module is None:` check. Missing mappings now fail loudly again with AttributeError, restoring the safety property the guard had bypassed. Verified: 4-rank EP=4 import on DSv4-Flash-Base completes cleanly with no None-task interceptions. Signed-off-by: chcui <chcui@nvidia.com>

Add unit tests that assert the DSv4 bridge's mapping_registry contains: - decoder.hc_head_{fn,base,scale} as ReplicatedMappings (existing) - mtp.layers.N.hc_head_{fn,base,scale} as ReplicatedMappings (new) - mtp.layers.N.{e,h}_proj.weight as AutoMappings (new) - no reference to the deprecated concatenated eh_proj path anywhere Also asserts that with num_nextn_predict_layers=0 the MTP-side mappings are absent (regression guard for environments without MTP). mapping_registry only reads num_nextn_predict_layers from hf_config, so mocking with SimpleNamespace is sufficient — no fixtures or GPU needed. Signed-off-by: chcui <chcui@nvidia.com>

- docs/models/llm/deepseek-v4.md: variants table with parity status, architecture features (HC, CSA/DSA, hybrid attention, hash MoE, MTP), conversion / inference / parallelism guidance, MCore prerequisite list - docs/models/llm/index.md: register the new page in the toctree - examples/models/deepseek_v4/{README,conversion.sh,inference.sh}: real-model HF<->Megatron round-trip and generation, parameterized via WORKSPACE / MODEL_VARIANT / EP, defaulting to DeepSeek-V4-Flash-Base on 4xB200 - src/megatron/bridge/recipes/deepseek/deepseek_v4.py: deepseek_v4_pretrain_config with TP=1 (DSv4 constraint), EP=8 default, MTP=1, precision-aware optimizer with bf16 moments, alltoall dispatcher; parameterized for Flash / Pro variants - src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new recipe Slurm launch scripts intentionally deferred until the recipe has a finalized training-config name and parallelism layout for Pro / Pro-Base. Signed-off-by: chcui <chcui@nvidia.com>

DeepSeek-V4 layers add two mapping_proj matmuls per layer (one before attention, one before MLP) of shape hidden -> hidden * num_residual_streams, plus negligible alpha scalars and sinkhorn iterations. Without modeling this overhead, training-throughput TFLOPs/GPU readouts under-report work and overstate hardware efficiency. Add a conditional term in transformer_flops() gated on enable_hyper_connections, contributing 3 * 2 * num_layers * 2 * H^2 * num_residual_streams batch * seq_length FLOPs. The CSA per-layer attention reduction (sparse top-k context instead of dense O(s^2)) is intentionally not modeled here — that would be a smaller correction in the opposite direction, and leaving it out keeps the throughput estimate on the safe (over-estimating) side. Tests: - test_hc_flops_increase_when_enabled: HC-on > HC-off - test_hc_exact_overhead: matches the closed-form formula - test_hc_scales_with_residual_streams: doubling streams doubles delta Also add a placeholder functional-test scaffold for the toy DSv4 HF<->Megatron roundtrip. It is auto-skipped today because (a) transformers does not yet ship DeepseekV4ForCausalLM, and (b) the bridge's pinned MCore lacks the DSv4 prerequisites. Both skipif conditions auto-clear when prereqs land. Signed-off-by: chcui <chcui@nvidia.com>

- src/megatron/bridge/recipes/deepseek/deepseek_v4.py: - deepseek_v4_sft_config: full SFT, defaults to deepseek-ai/DeepSeek-V4-Flash (the post-trained variant), TP=1 EP=8 for a 2-node 4xB200 layout, MTP disabled for fine-tuning, fp32 master weights, max_lr=5e-6 - deepseek_v4_peft_config: LoRA/DoRA via default_peft_config, TP=1 EP=1 in the recipe (override via slurm to EP>=4 for Flash; the frozen base model still has to fit across ranks even though only adapters train), max_lr=1e-4 - shared _apply_dsv4_finetune_common helper enforces TP=1, MoE settings, bf16, and disables MTP for fine-tune codepaths - src/megatron/bridge/recipes/deepseek/__init__.py: export the new recipes - examples/models/deepseek_v4/{slurm_sft.sh, slurm_peft.sh}: Slurm launchers modelled on examples/models/gpt_oss/, parameterized via env vars (WORKSPACE / MODEL_VARIANT / PRETRAINED_CHECKPOINT / PARALLELISM_CONFIGS) and tuned for 4-GPU GB200/B200 nodes. Both refuse TP>1 with an explicit error since DSv4 only supports TP=1. - examples/models/deepseek_v4/README.md: updated file table to reflect the finalized recipe names and default layouts Verified that both recipes instantiate cleanly against the local Flash HF config: SFT yields TP=1 EP=8 lr=5e-6, PEFT yields TP=1 EP=1 lr=1e-4 with LoRA scheme; MTP is None and PEFT-specific assertions hold. Signed-off-by: chcui <chcui@nvidia.com>

cuichenx · 2026-05-19T16:58:08Z

dev ci passed with only one known failure about deepseek v3 training. Current PR does not affect training.
DSV4 tests were not skipped.
https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/26068735913/job/76660085078

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-05-19T17:26:21Z

/claude review

claude · 2026-05-19T17:34:41Z

Review -- DeepSeek V4 Bridge

Overall this is solid work: the bridge, FP8/MXFP4 dequant, HC alpha mapping, MTP mappings, and tests are well-structured. A few items to address:

Bugs

Unclosed file handles (3 occurrences): yaml.safe_load(open(rc)) in convert_checkpoints_multi_gpu.py (import and export paths) and hf_to_megatron_generate_text.py leaks file descriptors. Use with open(rc) as f: instead. See inline comments with suggested fixes.

Code quality

PipelineParallelLayerLayout checked by class name string in config_utils.py is fragile. Prefer isinstance() with an import, or document why a string check is necessary (e.g., circular import avoidance).
Duplicated pipeline-layout-reading logic: The YAML-based run_config.yaml to pipeline_model_parallel_layout read block is copy-pasted across convert_checkpoints_multi_gpu.py (export path) and hf_to_megatron_generate_text.py. Consider extracting a shared helper to reduce drift risk. (Not blocking.)

Test coverage

Unit tests cover mapping registry structure, HC head, MTP e/h_proj split, and config translation well.
Functional test covers EP=2 roundtrip.
Missing coverage: No test exercises PP > 1 (the generate_pipeline_layout path), the _dequant_mxfp4 code path, or the FP8 128-tile dequant. These are the highest-risk new code paths. Consider adding at least a unit test for _dequant_mxfp4 with a small synthetic tensor.
The functional test uses bare python instead of uv run python per CLAUDE.md conventions, though this may be intentional for the CI container.

Suggested test cases

No perf tests impacted.

Signed-off-by: weijiac <weijiac@NVIDIA.com>

weijiac0619 · 2026-05-19T19:08:36Z

/ok to test 6762bd7

cuichenx · 2026-05-19T21:01:41Z

/ok to test a9b3edd

cuichenx · 2026-05-19T23:17:03Z

/ok to test fe9131e

cuichenx · 2026-05-20T05:27:35Z

main ci passed again besides coverage. merging

dsv4 import

803adff

Signed-off-by: weijiac <weijiac@NVIDIA.com>

weijiac0619 requested a review from cuichenx April 28, 2026 23:33

weijiac0619 marked this pull request as draft April 28, 2026 23:33

weijiac0619 and others added 2 commits April 28, 2026 16:33

Merge branch 'main' into weijia_dsv4

1abfaa6

test forward

7b48341

Signed-off-by: weijiac <weijiac@NVIDIA.com>

cuichenx reviewed Apr 30, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/conversion/model_bridge.py Outdated

Comment thread src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py Outdated

weijiac0619 added 4 commits May 1, 2026 12:10

Guard GlmMoeDsa import for containers with transformers <5.0

3b86c12

Tiny model test: restore all layer types, fix zero embedding init

faedf46

cuichenx reviewed May 1, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py Outdated

Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py

weijiac0619 and others added 14 commits May 1, 2026 18:35

Enable fused RoPE for DSv4 (non-fused path broken for inverse RoPE)

d84f58f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clean

92adb56

fix mscale

3fe2933

Signed-off-by: weijiac <weijiac@nvidia.com>

wip

34a7bd3

Signed-off-by: weijiac <weijiac@nvidia.com>

Update handoff doc: all changes committed and pushed

65a187d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove superseded/broken test scripts

43a2b5f

- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py - dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cuichenx mentioned this pull request May 8, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

cuichenx added 3 commits May 8, 2026 16:56

[dsv4] defer training recipes and launchers

7d7eec2

[docs] simplify DeepSeek V4 examples and parity status

1395c71

copy-pr-bot Bot temporarily deployed to public May 19, 2026 01:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 01:37 Inactive

chore: revert temporary CI changes

efb4a83

Signed-off-by: Chen Cui <chcui@nvidia.com>

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread examples/conversion/convert_checkpoints_multi_gpu.py

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread examples/conversion/convert_checkpoints_multi_gpu.py

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread examples/conversion/hf_to_megatron_generate_text.py

claude Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/megatron/bridge/training/utils/config_utils.py Outdated

yaoyu-33 reviewed May 19, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py

yaoyu-33 reviewed May 19, 2026

View reviewed changes

Comment thread examples/models/deepseek_v4/README.md

weijiac0619 added 2 commits May 19, 2026 11:40

resolve claude comments

100d17c

Signed-off-by: weijiac <weijiac@NVIDIA.com>

yy comment

6762bd7

Signed-off-by: weijiac <weijiac@NVIDIA.com>

copy-pr-bot Bot temporarily deployed to public May 19, 2026 19:09 Inactive

copy-pr-bot Bot temporarily deployed to test May 19, 2026 19:09 Inactive

copy-pr-bot Bot temporarily deployed to public May 19, 2026 19:17 Inactive

cuichenx previously approved these changes May 19, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 19, 2026 19:33 Inactive

Merge branch 'main' into weijia_dsv4

a9b3edd

Merge branch 'main' into weijia_dsv4

fe9131e

dinhxuanvu mentioned this pull request Jun 3, 2026

_megatron_global_param_names_all_pp_ranks only gathers across PP, silently drops EP/ETP-only weights from saved HF index #4083

Closed

yaoyu-33 mentioned this pull request Jun 3, 2026

fix: gather conversion param names across WORLD with superset assertion #4130

Closed

Meirtz mentioned this pull request Jun 4, 2026

[recipe,docs,test] feat: add DeepSeek-V4-Flash SFT recipes, launcher and tests #4131

Merged

Conversation

weijiac0619 commented Apr 28, 2026 • edited by cuichenx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuichenx commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuichenx commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 19, 2026

Review -- DeepSeek V4 Bridge

Bugs

Code quality

Test coverage

Suggested test cases

Uh oh!

Uh oh!

Uh oh!

weijiac0619 commented May 19, 2026

Uh oh!

cuichenx commented May 19, 2026

Uh oh!

cuichenx commented May 19, 2026

Uh oh!

cuichenx commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weijiac0619 commented Apr 28, 2026 •

edited by cuichenx

Loading

cuichenx commented May 19, 2026 •

edited

Loading