DeepSeek V4 Bridge#3562
Conversation
Signed-off-by: weijiac <weijiac@NVIDIA.com>
Signed-off-by: weijiac <weijiac@NVIDIA.com>
- Rename linear_kv_up_proj → linear_kv_proj (MCore PR #4458) - Restore HC head mappings (hc_head_fn/base/scale) for learned_output_contract (MCore PR #4518) - Remove task-is-None guards in model_bridge.py (root cause fixed via allow_hf_name_mismatch) - Add allow_hf_name_mismatch to _HCAlphaSecondaryMapping - Handle transformers 5.x nested rope_scaling format - Handle compress_ratios/compress_rates naming + length trim - Explicit errors for missing config fields instead of silent fallbacks - AutoConfig.register: re-raise non-"already registered" errors
- Remove old test scripts (dsv4_generate.py, test_dsv4_bridge_smoke.py, test_dsv4_full_import.py) - Add new validation scripts: - dsv4_fresh_import_test.py: import + save ckpt + round-trip (all weights) + cosine sim - dsv4_fresh_generate.py: import + greedy generation with answer verification - dsv4_tiny_ref_vs_mg.py: tiny model layer-by-layer comparison vs official inference/model.py - Update copyright year to 2026
float8_e8m0fnu.to(float32) already returns 2^(e-127). The old code applied an extra 2^(x-127), producing near-zero scales that zeroed all expert weights. Also fix tiny model test init and seq_len. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: weijiac <weijiac@nvidia.com>
5 MTP params (hc_head_fn/base/scale, e_proj, h_proj) are unmapped after MCore PR #4518 changed MTP from concatenated eh_proj to separate e_proj/h_proj. This guard prevents crash during fresh import. Revert when MTP mappings are fully implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Active test scripts for DSv4 validation: - dsv4_full_cosine_report.py: per-layer + sub-layer cosine sim report - dsv4_cosine_analysis.py: capture hidden states from both official and MCore - dsv4_last_hidden_cmp.py: post-contraction + logit comparison - dsv4_fresh_import_save.py: fresh HF import with checkpoint save - dsv4-bridge-handoff.md: handoff doc with results and instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py - dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map the three previously-unmapped per-MTP-layer Hyper-Connection head parameters (hc_head_fn / hc_head_base / hc_head_scale) using ReplicatedMapping, mirroring the existing decoder.hc_head_* triplet. Replace the concatenated _MTPEHProjMapping (legacy eh_proj) with two plain AutoMappings for e_proj and h_proj. After MCore #4518, the MTP layer with enable_hyper_connections=True instantiates these as separate ColumnParallelLinear modules; AutoMapping auto-detects column parallelism and shards along dim 0. The dead _MTPEHProjMapping class is removed. Also clarify the bridge module docstring: parity is verified only for DeepSeek-V4-Flash; Flash-Base, Pro, and Pro-Base share the architecture and quant dispatch but logit parity is unmeasured. Document the two quant variants explicitly (Flash uses FP8 attn + MXFP4 experts with F8_E8M0 scales; Flash-Base/Pro/Pro-Base use uniform FP8 with F32 scales). Verified: 4-rank EP=4 import on DeepSeek-V4-Flash-Base completes cleanly with all five previously-unmapped MTP params now resolving. Signed-off-by: chcui <chcui@nvidia.com> Signed-off-by: root <root@nvl72160-T13.cm.cluster>
The two `if task is None or task.megatron_module is None:` guards in build_conversion_tasks consumers were a temporary stopgap absorbing the five unmapped DSv4 MTP parameters (hc_head_fn/base/scale, e_proj, h_proj). With those mappings now in place (preceding commit), no None tasks reach the load/export loops, so revert to the standard `if task.megatron_module is None:` check. Missing mappings now fail loudly again with AttributeError, restoring the safety property the guard had bypassed. Verified: 4-rank EP=4 import on DSv4-Flash-Base completes cleanly with no None-task interceptions. Signed-off-by: chcui <chcui@nvidia.com>
Add unit tests that assert the DSv4 bridge's mapping_registry contains:
- decoder.hc_head_{fn,base,scale} as ReplicatedMappings (existing)
- mtp.layers.N.hc_head_{fn,base,scale} as ReplicatedMappings (new)
- mtp.layers.N.{e,h}_proj.weight as AutoMappings (new)
- no reference to the deprecated concatenated eh_proj path anywhere
Also asserts that with num_nextn_predict_layers=0 the MTP-side mappings
are absent (regression guard for environments without MTP).
mapping_registry only reads num_nextn_predict_layers from hf_config, so
mocking with SimpleNamespace is sufficient — no fixtures or GPU needed.
Signed-off-by: chcui <chcui@nvidia.com>
- docs/models/llm/deepseek-v4.md: variants table with parity status,
architecture features (HC, CSA/DSA, hybrid attention, hash MoE, MTP),
conversion / inference / parallelism guidance, MCore prerequisite list
- docs/models/llm/index.md: register the new page in the toctree
- examples/models/deepseek_v4/{README,conversion.sh,inference.sh}: real-model
HF<->Megatron round-trip and generation, parameterized via WORKSPACE /
MODEL_VARIANT / EP, defaulting to DeepSeek-V4-Flash-Base on 4xB200
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py: deepseek_v4_pretrain_config
with TP=1 (DSv4 constraint), EP=8 default, MTP=1, precision-aware optimizer
with bf16 moments, alltoall dispatcher; parameterized for Flash / Pro variants
- src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new recipe
Slurm launch scripts intentionally deferred until the recipe has a finalized
training-config name and parallelism layout for Pro / Pro-Base.
Signed-off-by: chcui <chcui@nvidia.com>
DeepSeek-V4 layers add two mapping_proj matmuls per layer (one before attention, one before MLP) of shape hidden -> hidden * num_residual_streams, plus negligible alpha scalars and sinkhorn iterations. Without modeling this overhead, training-throughput TFLOPs/GPU readouts under-report work and overstate hardware efficiency. Add a conditional term in transformer_flops() gated on enable_hyper_connections, contributing 3 * 2 * num_layers * 2 * H^2 * num_residual_streams batch * seq_length FLOPs. The CSA per-layer attention reduction (sparse top-k context instead of dense O(s^2)) is intentionally not modeled here — that would be a smaller correction in the opposite direction, and leaving it out keeps the throughput estimate on the safe (over-estimating) side. Tests: - test_hc_flops_increase_when_enabled: HC-on > HC-off - test_hc_exact_overhead: matches the closed-form formula - test_hc_scales_with_residual_streams: doubling streams doubles delta Also add a placeholder functional-test scaffold for the toy DSv4 HF<->Megatron roundtrip. It is auto-skipped today because (a) transformers does not yet ship DeepseekV4ForCausalLM, and (b) the bridge's pinned MCore lacks the DSv4 prerequisites. Both skipif conditions auto-clear when prereqs land. Signed-off-by: chcui <chcui@nvidia.com>
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py:
- deepseek_v4_sft_config: full SFT, defaults to deepseek-ai/DeepSeek-V4-Flash
(the post-trained variant), TP=1 EP=8 for a 2-node 4xB200 layout, MTP
disabled for fine-tuning, fp32 master weights, max_lr=5e-6
- deepseek_v4_peft_config: LoRA/DoRA via default_peft_config, TP=1 EP=1
in the recipe (override via slurm to EP>=4 for Flash; the frozen base
model still has to fit across ranks even though only adapters train),
max_lr=1e-4
- shared _apply_dsv4_finetune_common helper enforces TP=1, MoE settings,
bf16, and disables MTP for fine-tune codepaths
- src/megatron/bridge/recipes/deepseek/__init__.py: export the new recipes
- examples/models/deepseek_v4/{slurm_sft.sh, slurm_peft.sh}: Slurm launchers
modelled on examples/models/gpt_oss/, parameterized via env vars
(WORKSPACE / MODEL_VARIANT / PRETRAINED_CHECKPOINT / PARALLELISM_CONFIGS)
and tuned for 4-GPU GB200/B200 nodes. Both refuse TP>1 with an explicit
error since DSv4 only supports TP=1.
- examples/models/deepseek_v4/README.md: updated file table to reflect the
finalized recipe names and default layouts
Verified that both recipes instantiate cleanly against the local Flash HF
config: SFT yields TP=1 EP=8 lr=5e-6, PEFT yields TP=1 EP=1 lr=1e-4 with
LoRA scheme; MTP is None and PEFT-specific assertions hold.
Signed-off-by: chcui <chcui@nvidia.com>
|
dev ci passed with only one known failure about deepseek v3 training. Current PR does not affect training. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
/claude review |
Review -- DeepSeek V4 BridgeOverall this is solid work: the bridge, FP8/MXFP4 dequant, HC alpha mapping, MTP mappings, and tests are well-structured. A few items to address: Bugs
Code quality
Test coverage
Suggested test casesNo perf tests impacted. |
Signed-off-by: weijiac <weijiac@NVIDIA.com>
Signed-off-by: weijiac <weijiac@NVIDIA.com>
|
/ok to test 6762bd7 |
|
/ok to test a9b3edd |
|
/ok to test fe9131e |
|
main ci passed again besides coverage. merging |
Summary
uv.lock.Validation
pre-commit run --all-filespassed.uv lock --checkand lockeduv syncdry-runs passed.