Skip to content

DeepSeek V4 Bridge#3562

Merged
cuichenx merged 54 commits into
mainfrom
weijia_dsv4
May 20, 2026
Merged

DeepSeek V4 Bridge#3562
cuichenx merged 54 commits into
mainfrom
weijia_dsv4

Conversation

@weijiac0619

@weijiac0619 weijiac0619 commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds DeepSeek V4 bridge support using native Hugging Face DeepSeek V4 configs.
  • Adds FP8/MXFP4 checkpoint import/export mappings, docs/examples, and focused coverage.
  • Pins the MCore dev commit with DeepSeek V4 Part 3 support and refreshes uv.lock.

Validation

  • Full HSG EP8 import and export roundtrip completed.
  • pre-commit run --all-files passed.
  • uv lock --check and locked uv sync dry-runs passed.

Signed-off-by: weijiac <weijiac@NVIDIA.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@weijiac0619 weijiac0619 requested a review from cuichenx April 28, 2026 23:33
@weijiac0619 weijiac0619 marked this pull request as draft April 28, 2026 23:33
weijiac0619 and others added 2 commits April 28, 2026 16:33
Signed-off-by: weijiac <weijiac@NVIDIA.com>
Comment thread src/megatron/bridge/models/conversion/model_bridge.py Outdated
Comment thread src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py Outdated
- Rename linear_kv_up_proj → linear_kv_proj (MCore PR #4458)
- Restore HC head mappings (hc_head_fn/base/scale) for learned_output_contract (MCore PR #4518)
- Remove task-is-None guards in model_bridge.py (root cause fixed via allow_hf_name_mismatch)
- Add allow_hf_name_mismatch to _HCAlphaSecondaryMapping
- Handle transformers 5.x nested rope_scaling format
- Handle compress_ratios/compress_rates naming + length trim
- Explicit errors for missing config fields instead of silent fallbacks
- AutoConfig.register: re-raise non-"already registered" errors
- Remove old test scripts (dsv4_generate.py, test_dsv4_bridge_smoke.py, test_dsv4_full_import.py)
- Add new validation scripts:
  - dsv4_fresh_import_test.py: import + save ckpt + round-trip (all weights) + cosine sim
  - dsv4_fresh_generate.py: import + greedy generation with answer verification
  - dsv4_tiny_ref_vs_mg.py: tiny model layer-by-layer comparison vs official inference/model.py
- Update copyright year to 2026
Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py Outdated
Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py
weijiac0619 and others added 14 commits May 1, 2026 18:35
float8_e8m0fnu.to(float32) already returns 2^(e-127). The old code
applied an extra 2^(x-127), producing near-zero scales that zeroed
all expert weights. Also fix tiny model test init and seq_len.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: weijiac <weijiac@nvidia.com>
Signed-off-by: weijiac <weijiac@nvidia.com>
5 MTP params (hc_head_fn/base/scale, e_proj, h_proj) are unmapped after
MCore PR #4518 changed MTP from concatenated eh_proj to separate e_proj/h_proj.
This guard prevents crash during fresh import. Revert when MTP mappings are
fully implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Active test scripts for DSv4 validation:
- dsv4_full_cosine_report.py: per-layer + sub-layer cosine sim report
- dsv4_cosine_analysis.py: capture hidden states from both official and MCore
- dsv4_last_hidden_cmp.py: post-contraction + logit comparison
- dsv4_fresh_import_save.py: fresh HF import with checkpoint save
- dsv4-bridge-handoff.md: handoff doc with results and instructions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py
- dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map the three previously-unmapped per-MTP-layer Hyper-Connection head
parameters (hc_head_fn / hc_head_base / hc_head_scale) using
ReplicatedMapping, mirroring the existing decoder.hc_head_* triplet.

Replace the concatenated _MTPEHProjMapping (legacy eh_proj) with two plain
AutoMappings for e_proj and h_proj. After MCore #4518, the MTP layer with
enable_hyper_connections=True instantiates these as separate
ColumnParallelLinear modules; AutoMapping auto-detects column parallelism
and shards along dim 0. The dead _MTPEHProjMapping class is removed.

Also clarify the bridge module docstring: parity is verified only for
DeepSeek-V4-Flash; Flash-Base, Pro, and Pro-Base share the architecture
and quant dispatch but logit parity is unmeasured. Document the two
quant variants explicitly (Flash uses FP8 attn + MXFP4 experts with
F8_E8M0 scales; Flash-Base/Pro/Pro-Base use uniform FP8 with F32 scales).

Verified: 4-rank EP=4 import on DeepSeek-V4-Flash-Base completes cleanly
with all five previously-unmapped MTP params now resolving.

Signed-off-by: chcui <chcui@nvidia.com>
Signed-off-by: root <root@nvl72160-T13.cm.cluster>
The two `if task is None or task.megatron_module is None:` guards in
build_conversion_tasks consumers were a temporary stopgap absorbing the
five unmapped DSv4 MTP parameters (hc_head_fn/base/scale, e_proj, h_proj).

With those mappings now in place (preceding commit), no None tasks reach
the load/export loops, so revert to the standard
`if task.megatron_module is None:` check. Missing mappings now fail
loudly again with AttributeError, restoring the safety property the
guard had bypassed.

Verified: 4-rank EP=4 import on DSv4-Flash-Base completes cleanly with
no None-task interceptions.

Signed-off-by: chcui <chcui@nvidia.com>
Add unit tests that assert the DSv4 bridge's mapping_registry contains:
  - decoder.hc_head_{fn,base,scale} as ReplicatedMappings (existing)
  - mtp.layers.N.hc_head_{fn,base,scale} as ReplicatedMappings (new)
  - mtp.layers.N.{e,h}_proj.weight as AutoMappings (new)
  - no reference to the deprecated concatenated eh_proj path anywhere

Also asserts that with num_nextn_predict_layers=0 the MTP-side mappings
are absent (regression guard for environments without MTP).

mapping_registry only reads num_nextn_predict_layers from hf_config, so
mocking with SimpleNamespace is sufficient — no fixtures or GPU needed.

Signed-off-by: chcui <chcui@nvidia.com>
- docs/models/llm/deepseek-v4.md: variants table with parity status,
  architecture features (HC, CSA/DSA, hybrid attention, hash MoE, MTP),
  conversion / inference / parallelism guidance, MCore prerequisite list
- docs/models/llm/index.md: register the new page in the toctree
- examples/models/deepseek_v4/{README,conversion.sh,inference.sh}: real-model
  HF<->Megatron round-trip and generation, parameterized via WORKSPACE /
  MODEL_VARIANT / EP, defaulting to DeepSeek-V4-Flash-Base on 4xB200
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py: deepseek_v4_pretrain_config
  with TP=1 (DSv4 constraint), EP=8 default, MTP=1, precision-aware optimizer
  with bf16 moments, alltoall dispatcher; parameterized for Flash / Pro variants
- src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new recipe

Slurm launch scripts intentionally deferred until the recipe has a finalized
training-config name and parallelism layout for Pro / Pro-Base.

Signed-off-by: chcui <chcui@nvidia.com>
DeepSeek-V4 layers add two mapping_proj matmuls per layer (one before
attention, one before MLP) of shape hidden -> hidden * num_residual_streams,
plus negligible alpha scalars and sinkhorn iterations. Without modeling
this overhead, training-throughput TFLOPs/GPU readouts under-report
work and overstate hardware efficiency.

Add a conditional term in transformer_flops() gated on
enable_hyper_connections, contributing 3 * 2 * num_layers * 2 * H^2 *
num_residual_streams batch * seq_length FLOPs. The CSA per-layer
attention reduction (sparse top-k context instead of dense O(s^2)) is
intentionally not modeled here — that would be a smaller correction in
the opposite direction, and leaving it out keeps the throughput estimate
on the safe (over-estimating) side.

Tests:
- test_hc_flops_increase_when_enabled: HC-on > HC-off
- test_hc_exact_overhead: matches the closed-form formula
- test_hc_scales_with_residual_streams: doubling streams doubles delta

Also add a placeholder functional-test scaffold for the toy DSv4 HF<->Megatron
roundtrip. It is auto-skipped today because (a) transformers does not yet
ship DeepseekV4ForCausalLM, and (b) the bridge's pinned MCore lacks the
DSv4 prerequisites. Both skipif conditions auto-clear when prereqs land.

Signed-off-by: chcui <chcui@nvidia.com>
cuichenx added 3 commits May 8, 2026 16:56
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py:
  - deepseek_v4_sft_config: full SFT, defaults to deepseek-ai/DeepSeek-V4-Flash
    (the post-trained variant), TP=1 EP=8 for a 2-node 4xB200 layout, MTP
    disabled for fine-tuning, fp32 master weights, max_lr=5e-6
  - deepseek_v4_peft_config: LoRA/DoRA via default_peft_config, TP=1 EP=1
    in the recipe (override via slurm to EP>=4 for Flash; the frozen base
    model still has to fit across ranks even though only adapters train),
    max_lr=1e-4
  - shared _apply_dsv4_finetune_common helper enforces TP=1, MoE settings,
    bf16, and disables MTP for fine-tune codepaths
- src/megatron/bridge/recipes/deepseek/__init__.py: export the new recipes
- examples/models/deepseek_v4/{slurm_sft.sh, slurm_peft.sh}: Slurm launchers
  modelled on examples/models/gpt_oss/, parameterized via env vars
  (WORKSPACE / MODEL_VARIANT / PRETRAINED_CHECKPOINT / PARALLELISM_CONFIGS)
  and tuned for 4-GPU GB200/B200 nodes. Both refuse TP>1 with an explicit
  error since DSv4 only supports TP=1.
- examples/models/deepseek_v4/README.md: updated file table to reflect the
  finalized recipe names and default layouts

Verified that both recipes instantiate cleanly against the local Flash HF
config: SFT yields TP=1 EP=8 lr=5e-6, PEFT yields TP=1 EP=1 lr=1e-4 with
LoRA scheme; MTP is None and PEFT-specific assertions hold.

Signed-off-by: chcui <chcui@nvidia.com>
@cuichenx

cuichenx commented May 19, 2026

Copy link
Copy Markdown
Contributor

dev ci passed with only one known failure about deepseek v3 training. Current PR does not affect training.
DSV4 tests were not skipped.
https://github.com/NVIDIA-NeMo/Megatron-Bridge/actions/runs/26068735913/job/76660085078

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx

Copy link
Copy Markdown
Contributor

/claude review

Comment thread examples/conversion/convert_checkpoints_multi_gpu.py
Comment thread examples/conversion/convert_checkpoints_multi_gpu.py
Comment thread examples/conversion/hf_to_megatron_generate_text.py
Comment thread src/megatron/bridge/training/utils/config_utils.py Outdated
@claude

claude Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Review -- DeepSeek V4 Bridge

Overall this is solid work: the bridge, FP8/MXFP4 dequant, HC alpha mapping, MTP mappings, and tests are well-structured. A few items to address:

Bugs

  • Unclosed file handles (3 occurrences): yaml.safe_load(open(rc)) in convert_checkpoints_multi_gpu.py (import and export paths) and hf_to_megatron_generate_text.py leaks file descriptors. Use with open(rc) as f: instead. See inline comments with suggested fixes.

Code quality

  • PipelineParallelLayerLayout checked by class name string in config_utils.py is fragile. Prefer isinstance() with an import, or document why a string check is necessary (e.g., circular import avoidance).

  • Duplicated pipeline-layout-reading logic: The YAML-based run_config.yaml to pipeline_model_parallel_layout read block is copy-pasted across convert_checkpoints_multi_gpu.py (export path) and hf_to_megatron_generate_text.py. Consider extracting a shared helper to reduce drift risk. (Not blocking.)

Test coverage

  • Unit tests cover mapping registry structure, HC head, MTP e/h_proj split, and config translation well.
  • Functional test covers EP=2 roundtrip.
  • Missing coverage: No test exercises PP > 1 (the generate_pipeline_layout path), the _dequant_mxfp4 code path, or the FP8 128-tile dequant. These are the highest-risk new code paths. Consider adding at least a unit test for _dequant_mxfp4 with a small synthetic tensor.
  • The functional test uses bare python instead of uv run python per CLAUDE.md conventions, though this may be intentional for the CI container.

Suggested test cases

No perf tests impacted.


Comment thread src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py
Comment thread examples/models/deepseek_v4/README.md
Signed-off-by: weijiac <weijiac@NVIDIA.com>
Signed-off-by: weijiac <weijiac@NVIDIA.com>
@weijiac0619

Copy link
Copy Markdown
Contributor Author

/ok to test 6762bd7

cuichenx
cuichenx previously approved these changes May 19, 2026
@cuichenx

Copy link
Copy Markdown
Contributor

/ok to test a9b3edd

@cuichenx

Copy link
Copy Markdown
Contributor

/ok to test fe9131e

@cuichenx

Copy link
Copy Markdown
Contributor

main ci passed again besides coverage. merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work full-test-suite needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants