Skip to content

[Dev] Support EP with HSDP#2800

Merged
yanring merged 11 commits into
NVIDIA:devfrom
wplf:jinliang/hsdp-ep
Jan 29, 2026
Merged

[Dev] Support EP with HSDP#2800
yanring merged 11 commits into
NVIDIA:devfrom
wplf:jinliang/hsdp-ep

Conversation

@wplf

@wplf wplf commented Jan 4, 2026

Copy link
Copy Markdown
Member

What does this PR do ?

main PR: #2840

This MR adds HSDP support for Expert Parallelism. With this change, DeepSeek‑v3 and other MoE models can be trained using M‑FSDP as well as HSDP+EP. HSDP enables better use of local bandwidth within a larger nvlink domain.

We ran convergence tests on the DeepSeek Proxy model. model. The experiments were based on the latest dev branch, with an additional patch that fixes MoE checkpoint save/load.

To validate the correctness of HSDP+EP, we loaded an FSDP+EP checkpoint at step 100 and resumed training with HSDP+EP. The loss curves and gradient norms matched closely.

image image

For example, to use HSDP=2 and EP size=2, you need to add these argument to your script

--num-distributed-optimizer-instances 2 \
--outer-dp-sharding-strategy optim
--expert-model-parallel-size 2 

@wplf wplf requested review from a team as code owners January 4, 2026 05:55
@copy-pr-bot

copy-pr-bot Bot commented Jan 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wplf wplf force-pushed the jinliang/hsdp-ep branch 3 times, most recently from 87e89a7 to 6cdd7ba Compare January 4, 2026 06:08
Comment thread megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py Outdated
@wplf wplf marked this pull request as draft January 5, 2026 04:19
@shjwudp shjwudp requested a review from a team January 5, 2026 05:02
Comment thread megatron/core/optimizer/optimizer_config.py Outdated
@wplf wplf marked this pull request as ready for review January 6, 2026 02:15
@wplf wplf marked this pull request as draft January 6, 2026 02:16
@wplf wplf changed the title Jinliang/hsdp ep Support EP with HSDP Jan 6, 2026
@wplf wplf changed the title Support EP with HSDP [Dev] Support EP with HSDP Jan 6, 2026
@wplf wplf force-pushed the jinliang/hsdp-ep branch from 4057404 to a7e1464 Compare January 6, 2026 06:51
@wplf wplf force-pushed the jinliang/hsdp-ep branch from a7e1464 to 1c448e7 Compare January 6, 2026 06:54
@wplf wplf marked this pull request as ready for review January 7, 2026 08:52
@wplf

wplf commented Jan 7, 2026

Copy link
Copy Markdown
Member Author

@BoxiangW
Hi, boxiang.
The loss convergency and grad norm curve of HSDP+EP vs FSDP+EP have been tested.
Is there anything else that I need to do?

@wplf wplf force-pushed the jinliang/hsdp-ep branch from 4de1eef to 8e04f32 Compare January 8, 2026 02:34
@shjwudp

shjwudp commented Jan 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test 8e04f32

@shjwudp

shjwudp commented Jan 22, 2026

Copy link
Copy Markdown
Contributor

Hi @NVIDIA/core-devtech @NVIDIA/core-nemo , could you help review this PR? thanks.

@shjwudp shjwudp mentioned this pull request Jan 22, 2026
6 tasks
@shjwudp shjwudp added the Final Review PR is in the "final review" stage label Jan 23, 2026
@cspades

cspades commented Jan 23, 2026

Copy link
Copy Markdown
Member

Had nits on the main branch version of this PR, just documentation-related (use "dp_shard" and "dp_outer" terminology for public-facing documentation): #2840

This PR can be merged immediately.

@wplf

wplf commented Jan 24, 2026

Copy link
Copy Markdown
Member Author

Hi, cory.
Do I need to rename dp_cp to dp_shard_cp in this branch?
Or do I need to rename it in both PR-2800 and PR-2840?

I'm kind of confused.

Signed-off-by: jinliangl <jinliangl@nvidia.com>
auto-merge was automatically disabled January 24, 2026 03:20

Head branch was pushed to by a user without write access

@yanring yanring enabled auto-merge January 27, 2026 05:34
@yanring

yanring commented Jan 27, 2026

Copy link
Copy Markdown
Contributor

Please add unit tests before the merging, thanks!

auto-merge was automatically disabled January 28, 2026 06:16

Head branch was pushed to by a user without write access

@wplf

wplf commented Jan 28, 2026

Copy link
Copy Markdown
Member Author

Please add unit tests before the merging, thanks!

Hi, @yanring
Jack has add unittest in the main branch, and I've merged it, and test it in wandb.

  • Main branch's HSDP + EP with outer dim no_shard is bugged.
  • Main branch's HSDP + EP with outer shard optim will crash.

Our PR-2840 fix them and the outcome looks well. cc @shjwudp

Please let me know if there is anything else I can do.
image

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 29, 2026
@yanring yanring enabled auto-merge January 29, 2026 16:06
@yanring

yanring commented Jan 29, 2026

Copy link
Copy Markdown
Contributor

/ok to test 1d011d8

@yanring yanring added this pull request to the merge queue Jan 29, 2026
Merged via the queue into NVIDIA:dev with commit bde9e32 Jan 29, 2026
46 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Final Review PR is in the "final review" stage module: megatron-fsdp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants