[Dev] Support EP with HSDP#2800
Conversation
87e89a7 to
6cdd7ba
Compare
|
@BoxiangW |
|
/ok to test 8e04f32 |
|
Hi @NVIDIA/core-devtech @NVIDIA/core-nemo , could you help review this PR? thanks. |
|
Had nits on the main branch version of this PR, just documentation-related (use "dp_shard" and "dp_outer" terminology for public-facing documentation): #2840 This PR can be merged immediately. |
Signed-off-by: jinliangl <jinliangl@nvidia.com>
Head branch was pushed to by a user without write access
|
Please add unit tests before the merging, thanks! |
Head branch was pushed to by a user without write access
Hi, @yanring
Our PR-2840 fix them and the outcome looks well. cc @shjwudp |
|
/ok to test 1d011d8 |

What does this PR do ?
main PR: #2840
This MR adds HSDP support for Expert Parallelism. With this change, DeepSeek‑v3 and other MoE models can be trained using M‑FSDP as well as HSDP+EP. HSDP enables better use of local bandwidth within a larger nvlink domain.
We ran convergence tests on the DeepSeek Proxy model. model. The experiments were based on the latest dev branch, with an additional patch that fixes MoE checkpoint save/load.
To validate the correctness of HSDP+EP, we loaded an FSDP+EP checkpoint at step 100 and resumed training with HSDP+EP. The loss curves and gradient norms matched closely.
For example, to use HSDP=2 and EP size=2, you need to add these argument to your script