[Dev] docs(megatron-fsdp): add Megatron-FSDP user guide by xuwchen · Pull Request #2397 · NVIDIA/Megatron-LM

xuwchen · 2025-11-25T14:32:17Z

What does this PR do ?

main PR: #2396

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-11-25T14:32:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yanring · 2025-11-25T14:48:20Z

`docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_mfsdp_deepseek_v3.sh`

Line 1:

🔴 Critical: The shebang uses a full-width exclamation mark (#！/bin/bash) instead of ASCII (#!/bin/bash). This will prevent the script from executing properly. Please replace the character with the standard ASCII !.

`docs/discussions/megatron-fsdp-user-guide/megatron-fsdp-user-guide.md`

Line 43:

Minor grammar suggestion: "This step avoids potential bubbles in the CUDA stream."

Line 51:

Typo: "douhle buffers" → "double buffers"

Line 67:

Typo: "registraion" → "registration"

Line 76:

Typo: "registraion" → "registration"

Line 88-89:

Consider adding a note clarifying where the output JSON file is saved:
This will create a `param_to_param_group_map.json` file in the `/path/to/param_to_param_group_map` directory.

`docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_checkpoint_convert.sh`

Line 50:

The file is missing a trailing newline. Please add a newline at the end for POSIX compliance.

BestJuly · 2025-11-25T15:58:58Z

LGTM. Could you please help delete the Optimizing DeepSeek-V3 Training Performance on NVIDIA GB200 NVL72 part in the README.md in this PR because this has been deleted in commit 3c1b98e

yanring · 2025-11-26T06:46:15Z

/ok to test 5139bcd

Skylion007 · 2025-11-26T18:55:35Z

+
+Allows model initialization using meta device, followed by layer-by-layer initialization of distributed model weight buffers via the `Module.reset_parameters` API, facilitating the initialization of extremely large models.
+
+#### 4. Add `--grad-reduce-in-bf16`


This has numerical issues at large scale though right? Like very large models at very large batch sizes? Or have you not observed this for pretraining very large 100B+ models?

Thanks for bringing this up. Have you run into concrete numerical issues at very large model and batch sizes in your experiments?

BoxiangW · 2025-12-01T02:41:08Z

Thanks Xuwen for this great documentation, can you also move it or at least link it here (code)? Not sure where is this doc/discussion being published.

zhujian19891203 · 2026-01-05T03:13:17Z

+
+## Checkpoint Conversion from 3D-Parallel to Megatron-FSDP
+
+Megatron-FSDP introduces a new checkpoint format `fsdp_dtensor`. To help you smoothly transition from 3D-Parallel to Megatron-FSDP, we provide a script for converting checkpoints from the `torch_dist` format to the `fsdp_dtensor` format. Using DeepSeek-V3 as an example, the detailed conversion process is described below.


Thanks for your contribution, it's really a great starting point for me to try the Megatron-FSDP. Since our subsequent evaluation tasks are based on the torch_dist ckpt type, could you provide a path to convert ckpt format from fsdp_dtensor to torch_dist or huggingface? Or huggingface format?

For checkpoint conversion, this feature will be added to Megatron-Bridge and is currently under development. Once it is ready, it will support converting between fsdp_dtensor and Hugging Face formats. Since Megatron-Bridge already supports conversion between torch_dist and Hugging Face formats, this will make it straightforward to go from fsdp_dtensor to torch_dist as well.

xuwchen · 2026-01-16T06:50:46Z

/ok to test 8283f7b

xuwchen requested review from a team as code owners November 25, 2025 14:32

xuwchen force-pushed the mfsdp_user_guide_dev branch from 62101b8 to 55e3025 Compare November 25, 2025 15:14

yanring approved these changes Nov 26, 2025

View reviewed changes

yanring enabled auto-merge November 26, 2025 06:45

copy-pr-bot Bot temporarily deployed to nemo-ci November 26, 2025 06:46 Inactive

ko3n1g added this to the Core 0.16 milestone Nov 26, 2025

copy-pr-bot Bot temporarily deployed to nemo-ci November 26, 2025 06:46 Inactive

yanring added this pull request to the merge queue Nov 26, 2025

github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Nov 26, 2025

Skylion007 reviewed Nov 26, 2025

View reviewed changes

shjwudp requested a review from a team November 27, 2025 03:43

zhujian19891203 reviewed Jan 5, 2026

View reviewed changes

zhujian19891203 mentioned this pull request Jan 5, 2026

How to convert checkpoint from fsdp_dtensor to torch_dist or huggingface in megatron-fsdp mode? #2805

Open

xuwchen added the dev branch Dev branch related issues and development label Jan 7, 2026

xuwchen added 7 commits January 15, 2026 22:40

add Megatron-FSDP user guide

da00549

update Megatron-FSDP user guide description in README

29e6373

fix typos and formatting issues

f6f94c7

remove DeepSeek-V3 GB200 optimization guide from README

13a3415

add note about nccl_ub incompatibility with segmentable allocator

e6cadeb

add link to Megatron-FSDP user guide in custom_fsdp.md

f3dd2e7

improve Megatron-FSDP user guide based on review feedback

8283f7b

xuwchen force-pushed the mfsdp_user_guide_dev branch from b24264f to 8283f7b Compare January 16, 2026 06:41

copy-pr-bot Bot temporarily deployed to nemo-ci January 16, 2026 06:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 16, 2026 06:51 Inactive

xuwchen added this pull request to the merge queue Jan 16, 2026

Merged via the queue into NVIDIA:dev with commit b927e1f Jan 16, 2026
32 checks passed

xuwchen deleted the mfsdp_user_guide_dev branch January 16, 2026 06:59

yanring mentioned this pull request Jan 26, 2026

[ROADMAP][Updated on April 07] Megatron Core MoE Roadmap #1729

Open

48 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] docs(megatron-fsdp): add Megatron-FSDP user guide#2397

[Dev] docs(megatron-fsdp): add Megatron-FSDP user guide#2397
xuwchen merged 7 commits into
NVIDIA:devfrom
xuwchen:mfsdp_user_guide_dev

xuwchen commented Nov 25, 2025

Uh oh!

copy-pr-bot Bot commented Nov 25, 2025

Uh oh!

yanring commented Nov 25, 2025

Uh oh!

BestJuly commented Nov 25, 2025

Uh oh!

yanring commented Nov 26, 2025

Uh oh!

Uh oh!

Skylion007 Nov 26, 2025

Uh oh!

xuwchen Dec 1, 2025

Uh oh!

Uh oh!

BoxiangW commented Dec 1, 2025

Uh oh!

zhujian19891203 Jan 5, 2026

Uh oh!

xuwchen Jan 15, 2026

Uh oh!

xuwchen commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants


		Allows model initialization using meta device, followed by layer-by-layer initialization of distributed model weight buffers via the `Module.reset_parameters` API, facilitating the initialization of extremely large models.

		#### 4. Add `--grad-reduce-in-bf16`


		## Checkpoint Conversion from 3D-Parallel to Megatron-FSDP

		Megatron-FSDP introduces a new checkpoint format `fsdp_dtensor`. To help you smoothly transition from 3D-Parallel to Megatron-FSDP, we provide a script for converting checkpoints from the `torch_dist` format to the `fsdp_dtensor` format. Using DeepSeek-V3 as an example, the detailed conversion process is described below.

Conversation

xuwchen commented Nov 25, 2025

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Nov 25, 2025

Uh oh!

yanring commented Nov 25, 2025

docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_mfsdp_deepseek_v3.sh

docs/discussions/megatron-fsdp-user-guide/megatron-fsdp-user-guide.md

docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_checkpoint_convert.sh

Uh oh!

BestJuly commented Nov 25, 2025

Uh oh!

yanring commented Nov 26, 2025

Uh oh!

Uh oh!

Skylion007 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

xuwchen Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BoxiangW commented Dec 1, 2025

Uh oh!

zhujian19891203 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

xuwchen Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

xuwchen commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

(Step 1): Add PR label `Expert Review`

`docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_mfsdp_deepseek_v3.sh`

`docs/discussions/megatron-fsdp-user-guide/megatron-fsdp-user-guide.md`

`docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_checkpoint_convert.sh`