Skip to content

Deepseek v4 Support#1195

Merged
zhaoyinglia merged 9 commits into
flagos-ai:mainfrom
LiJunscs:deepseek_v4
May 24, 2026
Merged

Deepseek v4 Support#1195
zhaoyinglia merged 9 commits into
flagos-ai:mainfrom
LiJunscs:deepseek_v4

Conversation

@LiJunscs

@LiJunscs LiJunscs commented May 7, 2026

Copy link
Copy Markdown
Collaborator

PR Category

[Train] Most of codes are copied from Megatron-LM Dev branch. The dev branch is different with main branch or release version.
Megatron LM PR:
DeepSeek-V4:
NVIDIA/Megatron-LM#4458
NVIDIA/Megatron-LM#4481
NVIDIA/Megatron-LM#4518
mHC:
NVIDIA/Megatron-LM#2943

PR Types

[New features]

PR Description

Add DeepSeek V4 model into FlagScale and Megatron-FL
Supported:

  1. CSA and HCA
  2. Hash Router
  3. mHC
  4. Engram(optional)

Unsupported:

  1. Sqrtsoftpuls router score function. ✅
  2. mHC recompute. ✅
  3. Overlap_grad_reduce and overlap_param_gather when Zero 1. ✅
  4. Any infra optimizations.

NOTE: This is only a draft pr, please reivew to give more suggestions.

such as:

  1. File structure.
    • All modules are moved to Megatron-FL. Only model_builder is left in Flagscale.
    • Delete Engram related CI or not?

Next plan:

  1. Distributed training. ✅
  2. Muon optimizer with Zero 1 adaptation. 😢
  3. Low precision is out of scope of this pr, limited by resource.
  4. Maybe context parallel for sparse attention.
  5. Welcome to give more suggestions.

@LiJunscs LiJunscs changed the title Deepseek v4 Deepseek v4 Support May 7, 2026
@LiJunscs LiJunscs self-assigned this May 7, 2026
@LiJunscs LiJunscs marked this pull request as ready for review May 12, 2026 11:21
LiJunscs added 4 commits May 20, 2026 20:06
1. fix incompatibility between engram and mhc;
2. validate training pipeline of deepseek v4.
3. add fake gold value of test deepseek_v4.
@LiJunscs

Copy link
Copy Markdown
Collaborator Author

869b205 /ok to review.

@LiJunscs

Copy link
Copy Markdown
Collaborator Author

Muon optimizer with Zero 1 adaptation will be push after releasing next version FlagOS.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds DeepSeek V4 training support into FlagScale/Megatron integration by introducing a dedicated DeepSeek V4 training entrypoint and model wiring (hybrid attention + hyper-connections + optional engram), along with a new functional test case and updated CLI/config plumbing.

Changes:

  • Introduce DeepSeek V4 model builder/model/block/layer implementations and a new train_deepseek_v4.py entrypoint.
  • Extend argument parsing/config translation for DeepSeek V4 hybrid attention and related settings.
  • Add/adjust functional test configs + gold values for a new DeepSeek V4 test case and update CUDA platform test selection.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
tests/test_utils/config/platforms/cuda.yaml Adds a DeepSeek test case to CUDA platform functional test selection.
tests/functional_tests/train/deepseek/gold_values/tp1_pp2_ep2_v4.json New gold values baseline for the DeepSeek V4 functional training case.
tests/functional_tests/train/deepseek/conf/train/tp1_pp2_ep2_v4.yaml New DeepSeek V4 training configuration used by functional tests.
tests/functional_tests/train/deepseek/conf/train/data.yaml Updates DeepSeek functional test data/tokenizer paths.
tests/functional_tests/train/deepseek/conf/tp1_pp2_ep2_v4.yaml New top-level Hydra experiment config pointing to the DeepSeek V4 entrypoint.
flagscale/train/megatron/training/arguments.py Adds DeepSeek V4 hybrid attention arg parsing + config mapping changes.
flagscale/train/megatron/training/arguments_fs.py Adds DeepSeek V4-related arg validation and a new optimizer flag; removes local engram arg registration.
flagscale/train/megatron/train_deepseek_v4.py New DeepSeek V4 training script/entrypoint.
flagscale/models/megatron/engram/short_conv.py Removes local Engram implementation code (moved upstream).
flagscale/models/megatron/engram/ngram_hash.py Removes local Engram hashing/tokenizer implementation (moved upstream).
flagscale/models/megatron/engram/multi_head_embedding.py Removes local Engram embedding implementation (moved upstream).
flagscale/models/megatron/engram/engram.py Removes local Engram module (moved upstream).
flagscale/models/megatron/engram/engram_transformer_layer.py Switches to using upstream megatron.core.transformer.engram.EngramModule.
flagscale/models/megatron/engram/engram_model.py Switches hash mapping import to upstream megatron.core.transformer.engram.
flagscale/models/megatron/engram/engram_config.py Removes local EngramConfig dataclass (moved upstream).
flagscale/models/megatron/deepseek_v4/deepseek_transformer_layer.py New DeepSeek-specific transformer layer wrapper with engram hooks.
flagscale/models/megatron/deepseek_v4/deepseek_transformer_block.py New DeepSeek-specific transformer block with hyper-connection + (planned) MHC recompute wiring.
flagscale/models/megatron/deepseek_v4/deepseek_model.py New DeepSeek GPTModel subclass with lazy/async engram hash computation.
flagscale/models/megatron/deepseek_v4/deepseek_builder.py New DeepSeek model builder/spec wiring for hybrid attention + optional engram.
examples/deepseek_v3/conf/train/next.yaml Adds an example training config (DeepSeek v3 example directory).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

train:
aquila: ["tp2_pp2", "tp4_pp2"]
deepseek: ["tp2_pp2_ep2", "tp2_pp2_ep2_engram"]
deepseek: ["tp2_pp2_ep2", "tp2_pp2_ep2_engram", "tp1_pp2_ep3_v4"]
Comment on lines +15 to +16
},
"fake": true
Comment on lines 1 to 3
data:
data_path: /home/gitlab-runner/data/pile_wikipedia_demo/pile_wikipedia_demo
data_path: /workspace/data/enron_emails_demo_text_document_qwen
split: 1
tokenizer:
tokenizer_type: QwenTokenizerFS
tokenizer_path: /home/gitlab-runner/tokenizers/qwentokenizer
tokenizer_path: /workspace/tokenizers/qwentokenizer
shell_cmds: null
envs:
HYDRA_FULL_ERROR: 1
CUDA_VISIBLE_DEVICES: "4,5,6,7"
Comment on lines +84 to +88
_broadcast(batch['tokens'])
_broadcast(batch['attention_mask'])
_broadcast(batch['position_ids'])
######### FlagScale Begin ########
if mpu.get_dualpipev_pipeline_model_parallel_world_size() is not None:
Comment on lines +14 to +18
if TYPE_CHECKING:
from megatron.core.tensor_parallel.random import CheckpointManager
else:
CheckpointManager = None

Comment on lines +296 to +300
next_layer = self.layers[l_no + 1]
if getattr(next_layer, "is_engram_layer", False):
next_layer.pre_compute_embedding(engram_hash_input_ids)
#### FlagScale End ####
hidden_states, context = layer(
Comment on lines +209 to +218
# Precompute the engram_hash_iput_ids, it will be used to create a TransformerChunkSchedulePlan.
engram_hash_input_ids = LazyHashInputIds(
hash_mapping=self.engram_hash,
input_ids=input_ids,
hash_stream=self._hash_stream,
)
if extra_block_kwargs is None:
extra_block_kwargs = {
"engram_hash_input_ids": engram_hash_input_ids,
}
"""Build decoder block spec and attach STM/HC placeholders to each local layer."""

"""GPT block spec."""
layer_norm_impl = TENorm
aoyulong
aoyulong previously approved these changes May 24, 2026
@zhaoyinglia zhaoyinglia merged commit 5c04b69 into flagos-ai:main May 24, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants