[Dev] Support EP with HSDP by wplf · Pull Request #2800 · NVIDIA/Megatron-LM

wplf · 2026-01-04T05:55:04Z

What does this PR do ?

main PR: #2840

This MR adds HSDP support for Expert Parallelism. With this change, DeepSeek‑v3 and other MoE models can be trained using M‑FSDP as well as HSDP+EP. HSDP enables better use of local bandwidth within a larger nvlink domain.

We ran convergence tests on the DeepSeek Proxy model. model. The experiments were based on the latest dev branch, with an additional patch that fixes MoE checkpoint save/load.

To validate the correctness of HSDP+EP, we loaded an FSDP+EP checkpoint at step 100 and resumed training with HSDP+EP. The loss curves and gradient norms matched closely.

For example, to use HSDP=2 and EP size=2, you need to add these argument to your script

--num-distributed-optimizer-instances 2 \
--outer-dp-sharding-strategy optim
--expert-model-parallel-size 2

copy-pr-bot · 2026-01-04T05:55:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wplf · 2026-01-07T08:54:19Z

@BoxiangW
Hi, boxiang.
The loss convergency and grad norm curve of HSDP+EP vs FSDP+EP have been tested.
Is there anything else that I need to do?

shjwudp · 2026-01-09T07:23:29Z

/ok to test 8e04f32

shjwudp · 2026-01-22T07:45:24Z

Hi @NVIDIA/core-devtech @NVIDIA/core-nemo , could you help review this PR? thanks.

cspades · 2026-01-23T19:27:06Z

Had nits on the main branch version of this PR, just documentation-related (use "dp_shard" and "dp_outer" terminology for public-facing documentation): #2840

This PR can be merged immediately.

wplf · 2026-01-24T03:06:10Z

Hi, cory.
Do I need to rename dp_cp to dp_shard_cp in this branch?
Or do I need to rename it in both PR-2800 and PR-2840?

I'm kind of confused.

Signed-off-by: jinliangl <jinliangl@nvidia.com>

yanring · 2026-01-27T05:47:34Z

Please add unit tests before the merging, thanks!

wplf · 2026-01-28T10:14:20Z

Please add unit tests before the merging, thanks!

Hi, @yanring
Jack has add unittest in the main branch, and I've merged it, and test it in wandb.

Main branch's HSDP + EP with outer dim no_shard is bugged.
Main branch's HSDP + EP with outer shard optim will crash.

Our PR-2840 fix them and the outcome looks well. cc @shjwudp

Please let me know if there is anything else I can do.

yanring · 2026-01-29T16:06:56Z

/ok to test 1d011d8

wplf requested review from a team as code owners January 4, 2026 05:55

github-actions Bot added the community-request label Jan 4, 2026

wplf force-pushed the jinliang/hsdp-ep branch 3 times, most recently from 87e89a7 to 6cdd7ba Compare January 4, 2026 06:08

lilei199908 reviewed Jan 4, 2026

View reviewed changes

Comment thread megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py Outdated

wplf marked this pull request as draft January 5, 2026 04:19

shjwudp requested a review from a team January 5, 2026 05:02

BoxiangW reviewed Jan 5, 2026

View reviewed changes

Comment thread megatron/core/optimizer/optimizer_config.py Outdated

wplf marked this pull request as ready for review January 6, 2026 02:15

wplf marked this pull request as draft January 6, 2026 02:16

wplf changed the title ~~Jinliang/hsdp ep~~ Support EP with HSDP Jan 6, 2026

wplf changed the title ~~Support EP with HSDP~~ [Dev] Support EP with HSDP Jan 6, 2026

Jinliang Li and others added 3 commits January 6, 2026 14:50

support hsdp with expert parallel

89c64fa

remove useless num_distributed_optimizer_instances in optimizer_config

8395118

remove useless group_id

ee1d15c

wplf force-pushed the jinliang/hsdp-ep branch from 4057404 to a7e1464 Compare January 6, 2026 06:51

add hybrid_fsdp_expt_group in pg_collection and other minor change

1c448e7

wplf force-pushed the jinliang/hsdp-ep branch from a7e1464 to 1c448e7 Compare January 6, 2026 06:54

lint code by bash ./tools/autoformat.sh

4de1eef

wplf marked this pull request as ready for review January 7, 2026 08:52

wplf force-pushed the jinliang/hsdp-ep branch from 4de1eef to 8e04f32 Compare January 8, 2026 02:34

BoxiangW approved these changes Jan 8, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci January 9, 2026 07:23 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 9, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci January 22, 2026 00:50 Inactive

copy-pr-bot Bot temporarily deployed to test January 22, 2026 00:51 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 22, 2026 01:43 Inactive

shjwudp mentioned this pull request Jan 22, 2026

Support EP with HSDP #2840

Merged

6 tasks

shjwudp added the Final Review PR is in the "final review" stage label Jan 23, 2026

revert dp_cp naming back to dp_shard_cp

e6af256

Signed-off-by: jinliangl <jinliangl@nvidia.com>

auto-merge was automatically disabled January 24, 2026 03:20
Head branch was pushed to by a user without write access

yanring enabled auto-merge January 27, 2026 05:34

Merge branch 'dev' into hsdp-ep

1d011d8

auto-merge was automatically disabled January 28, 2026 06:16
Head branch was pushed to by a user without write access

lilei199908 approved these changes Jan 29, 2026

View reviewed changes

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 29, 2026

yanring enabled auto-merge January 29, 2026 16:06

yanring approved these changes Jan 29, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 16:07 Inactive

copy-pr-bot Bot temporarily deployed to test January 29, 2026 16:07 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 16:46 Inactive

yanring added this pull request to the merge queue Jan 29, 2026

Merged via the queue into NVIDIA:dev with commit bde9e32 Jan 29, 2026
46 of 48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Support EP with HSDP#2800

[Dev] Support EP with HSDP#2800
yanring merged 11 commits into
NVIDIA:devfrom
wplf:jinliang/hsdp-ep

wplf commented Jan 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jan 4, 2026

Uh oh!

Uh oh!

Uh oh!

wplf commented Jan 7, 2026

Uh oh!

shjwudp commented Jan 9, 2026

Uh oh!

shjwudp commented Jan 22, 2026

Uh oh!

cspades commented Jan 23, 2026 •

edited

Loading

Uh oh!

wplf commented Jan 24, 2026

Uh oh!

yanring commented Jan 27, 2026

Uh oh!

wplf commented Jan 28, 2026 •

edited

Loading

Uh oh!

yanring commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

wplf commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Uh oh!

copy-pr-bot Bot commented Jan 4, 2026

Uh oh!

Uh oh!

Uh oh!

wplf commented Jan 7, 2026

Uh oh!

shjwudp commented Jan 9, 2026

Uh oh!

shjwudp commented Jan 22, 2026

Uh oh!

cspades commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wplf commented Jan 24, 2026

Uh oh!

yanring commented Jan 27, 2026

Uh oh!

wplf commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanring commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wplf commented Jan 4, 2026 •

edited

Loading

cspades commented Jan 23, 2026 •

edited

Loading

wplf commented Jan 28, 2026 •

edited

Loading