Skip to content

[Dev] docs(megatron-fsdp): add Megatron-FSDP user guide#2397

Merged
xuwchen merged 7 commits into
NVIDIA:devfrom
xuwchen:mfsdp_user_guide_dev
Jan 16, 2026
Merged

[Dev] docs(megatron-fsdp): add Megatron-FSDP user guide#2397
xuwchen merged 7 commits into
NVIDIA:devfrom
xuwchen:mfsdp_user_guide_dev

Conversation

@xuwchen

@xuwchen xuwchen commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

What does this PR do ?

main PR: #2396

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@xuwchen xuwchen requested review from a team as code owners November 25, 2025 14:32
@copy-pr-bot

copy-pr-bot Bot commented Nov 25, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yanring

yanring commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_mfsdp_deepseek_v3.sh

Line 1:

🔴 Critical: The shebang uses a full-width exclamation mark (#!/bin/bash) instead of ASCII (#!/bin/bash). This will prevent the script from executing properly. Please replace the character with the standard ASCII !.


docs/discussions/megatron-fsdp-user-guide/megatron-fsdp-user-guide.md

Line 43:

Minor grammar suggestion: "This step avoids potential bubbles in the CUDA stream."

Line 51:

Typo: "douhle buffers" → "double buffers"

Line 67:

Typo: "registraion" → "registration"

Line 76:

Typo: "registraion" → "registration"

Line 88-89:

Consider adding a note clarifying where the output JSON file is saved:

This will create a `param_to_param_group_map.json` file in the `/path/to/param_to_param_group_map` directory.

docs/discussions/megatron-fsdp-user-guide/example-scripts/sbatch_checkpoint_convert.sh

Line 50:

The file is missing a trailing newline. Please add a newline at the end for POSIX compliance.

@xuwchen xuwchen force-pushed the mfsdp_user_guide_dev branch from 62101b8 to 55e3025 Compare November 25, 2025 15:14
@BestJuly

Copy link
Copy Markdown
Contributor

LGTM. Could you please help delete the Optimizing DeepSeek-V3 Training Performance on NVIDIA GB200 NVL72 part in the README.md in this PR because this has been deleted in commit 3c1b98e

@yanring yanring enabled auto-merge November 26, 2025 06:45
@yanring

yanring commented Nov 26, 2025

Copy link
Copy Markdown
Contributor

/ok to test 5139bcd

@ko3n1g ko3n1g added this to the Core 0.16 milestone Nov 26, 2025
@yanring yanring added this pull request to the merge queue Nov 26, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Nov 26, 2025

Allows model initialization using meta device, followed by layer-by-layer initialization of distributed model weight buffers via the `Module.reset_parameters` API, facilitating the initialization of extremely large models.

#### 4. Add `--grad-reduce-in-bf16`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has numerical issues at large scale though right? Like very large models at very large batch sizes? Or have you not observed this for pretraining very large 100B+ models?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. Have you run into concrete numerical issues at very large model and batch sizes in your experiments?

@shjwudp shjwudp requested a review from a team November 27, 2025 03:43
@BoxiangW

BoxiangW commented Dec 1, 2025

Copy link
Copy Markdown
Contributor

Thanks Xuwen for this great documentation, can you also move it or at least link it here (code)? Not sure where is this doc/discussion being published.


## Checkpoint Conversion from 3D-Parallel to Megatron-FSDP

Megatron-FSDP introduces a new checkpoint format `fsdp_dtensor`. To help you smoothly transition from 3D-Parallel to Megatron-FSDP, we provide a script for converting checkpoints from the `torch_dist` format to the `fsdp_dtensor` format. Using DeepSeek-V3 as an example, the detailed conversion process is described below.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, it's really a great starting point for me to try the Megatron-FSDP. Since our subsequent evaluation tasks are based on the torch_dist ckpt type, could you provide a path to convert ckpt format from fsdp_dtensor to torch_dist or huggingface? Or huggingface format?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checkpoint conversion, this feature will be added to Megatron-Bridge and is currently under development. Once it is ready, it will support converting between fsdp_dtensor and Hugging Face formats. Since Megatron-Bridge already supports conversion between torch_dist and Hugging Face formats, this will make it straightforward to go from fsdp_dtensor to torch_dist as well.

@xuwchen xuwchen force-pushed the mfsdp_user_guide_dev branch from b24264f to 8283f7b Compare January 16, 2026 06:41
@xuwchen

xuwchen commented Jan 16, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 8283f7b

@xuwchen xuwchen added this pull request to the merge queue Jan 16, 2026
Merged via the queue into NVIDIA:dev with commit b927e1f Jan 16, 2026
32 checks passed
@xuwchen xuwchen deleted the mfsdp_user_guide_dev branch January 16, 2026 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev branch Dev branch related issues and development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants