Skip to content

docs - Update user manual with new MoE features and Megatron FSDP#2529

Merged
yaoyu-33 merged 5 commits into
NVIDIA-NeMo:mainfrom
onel:askmanu/moe-fsdp-docs
Mar 12, 2026
Merged

docs - Update user manual with new MoE features and Megatron FSDP#2529
yaoyu-33 merged 5 commits into
NVIDIA-NeMo:mainfrom
onel:askmanu/moe-fsdp-docs

Conversation

@onel

@onel onel commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Changes:

  1. docs/parallelisms.md - Adding DeepEP/HybridEP optimizations, token dropping, and advanced MoE features
  2. docs/training/megatron-fsdp.md - New comprehensive guide for Megatron FSDP
  3. docs/training/checkpointing.md - Updating with fsdp_dtensor checkpoint format information

Fixes #1722

@copy-pr-bot

copy-pr-bot Bot commented Feb 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@onel onel changed the title [Documentation] Update user manual with new MoE features and Megatron FSDP docs - Update user manual with new MoE features and Megatron FSDP Feb 25, 2026
@coderabbitai

coderabbitai Bot commented Feb 25, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Documentation updates introduce Megatron FSDP comprehensive guide and expand MoE optimization content with DeepEP/HybridEP dispatchers. Enhanced checkpoint format documentation clarifies compatibility across different parallelization strategies and FSDP variants.

Changes

Cohort / File(s) Summary
MoE Optimization Features
docs/parallelisms.md
Renamed and expanded "DeepEP Optimization" section to "DeepEP and HybridEP Optimizations". Introduces two high-performance MoE token dispatchers (DeepEP and HybridEP) with architecture-specific availability. Adds configuration examples using apply_flex_dispatcher_backend with values "deepep" and "hybridep". Includes new GPTModelProvider parameters (moe_expert_capacity_factor, moe_router_topk, moe_token_dispatcher_type, moe_router_load_balancing_type, moe_ffn_hidden_size). Documents Token Dropping for Load Balancing with capacity-factor semantics and related requirements.
Checkpoint Format Documentation
docs/training/checkpointing.md
Expands checkpoint format coverage with torch_dist, zarr, and fsdp_dtensor formats. Introduces "Available Formats" section detailing characteristics and applicability. Adds "Format Selection" code example for DDP/TP/PP and Megatron FSDP scenarios. Includes "Format Compatibility" matrix across DDP, Distributed Optimizer, Megatron FSDP, Torch FSDP2, and Async Save. Clarifies fsdp_dtensor requirement for Megatron FSDP. Adds "Performance Optimizations" section and "Related Documentation" subsection.
Megatron FSDP Guide
docs/training/megatron-fsdp.md
New comprehensive documentation file covering Megatron FSDP concepts, configuration, compatibility, automatic adjustments, complete configuration examples, migration from DDP, Torch FSDP2 alternatives, performance considerations, troubleshooting, and references.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

docs-only

Suggested reviewers

  • chenopis
  • ko3n1g
  • ananthsub
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Linked Issues check ✅ Passed All PR objectives align with issue #1722: documentation updates for MoE features (DeepEP, HybridEP, token dropping) and Megatron FSDP are present in parallelisms.md, checkpointing.md, and the new megatron-fsdp.md file.
Out of Scope Changes check ✅ Passed All three modified files contain documentation updates directly related to the linked issue requirements: MoE features and Megatron FSDP documentation, with no unrelated changes detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed This PR contains only documentation-only changes updating markdown files in docs/ directory with no source code modifications, test results are not required.
Title check ✅ Passed The title accurately summarizes the main changes: documentation updates covering both new MoE features (DeepEP, HybridEP, token dropping) and Megatron FSDP configuration, which are the primary focus across all three modified documentation files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/parallelisms.md`:
- Around line 303-304: Update the inaccurate inline comment above
apply_flex_dispatcher_backend(model_config,
moe_flex_dispatcher_backend="deepep") to list all supported DeepEP
targets—include B200 and B300 in addition to Ampere/Hopper (e.g., "Apply DeepEP
optimization (Ampere, Hopper, B200, B300)") so the comment matches the GPU
Architecture Requirements section.

In `@docs/training/megatron-fsdp.md`:
- Around line 218-221: The doc line about Torch FSDP2 is ambiguous about
checkpoint formats; update the bullet "- Does not require `fsdp_dtensor`
checkpoint format" to explicitly state that Torch FSDP2 uses the `torch_dist`
checkpoint format (e.g., "- Uses `torch_dist` checkpoint format, not
`fsdp_dtensor`"), and add a short parenthetical or sentence pointing readers to
the relevant checkpointing.md section for details.
- Line 1: Add the new docs/training/megatron-fsdp.md to the table of contents by
editing docs/index.md: locate the "Training and Customization" toctree section
and add an entry for training/megatron-fsdp (use the same relative path style as
other entries in that section). Ensure the new line matches the existing
indentation and ordering convention used in docs/index.md so the file is
included in the generated docs.
- Around line 186-191: The fenced code block showing Python config settings
(dist_config.use_megatron_fsdp and ddp_config.use_megatron_fsdp) is missing a
language tag; update that markdown block to use a Python fence by adding the
language specifier (```python) so the two lines dist_config.use_megatron_fsdp =
True and ddp_config.use_megatron_fsdp = True get proper syntax highlighting and
resolve the MD040 warning.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b058b66 and 172db21.

📒 Files selected for processing (3)
  • docs/parallelisms.md
  • docs/training/checkpointing.md
  • docs/training/megatron-fsdp.md

Comment thread docs/parallelisms.md
Comment on lines +303 to +304
# Apply DeepEP optimization (for Ampere/Hopper GPUs)
apply_flex_dispatcher_backend(model_config, moe_flex_dispatcher_backend="deepep")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inaccurate GPU comment in complete MoE example.

The inline comment # Apply DeepEP optimization (for Ampere/Hopper GPUs) omits B200 and B300, which are explicitly listed in the GPU Architecture Requirements section at lines 228–230 as supported DeepEP targets.

✏️ Proposed fix
-# Apply DeepEP optimization (for Ampere/Hopper GPUs)
+# Apply DeepEP optimization (for Ampere, Hopper, B200, B300 GPUs)
 apply_flex_dispatcher_backend(model_config, moe_flex_dispatcher_backend="deepep")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Apply DeepEP optimization (for Ampere/Hopper GPUs)
apply_flex_dispatcher_backend(model_config, moe_flex_dispatcher_backend="deepep")
# Apply DeepEP optimization (for Ampere, Hopper, B200, B300 GPUs)
apply_flex_dispatcher_backend(model_config, moe_flex_dispatcher_backend="deepep")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/parallelisms.md` around lines 303 - 304, Update the inaccurate inline
comment above apply_flex_dispatcher_backend(model_config,
moe_flex_dispatcher_backend="deepep") to list all supported DeepEP
targets—include B200 and B300 in addition to Ampere/Hopper (e.g., "Apply DeepEP
optimization (Ampere, Hopper, B200, B300)") so the comment matches the GPU
Architecture Requirements section.

@@ -0,0 +1,277 @@
# Megatron FSDP

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n 'megatron.*fsdp|megatron-fsdp' docs/index.md

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check if docs/index.md exists and view its structure
ls -la docs/index.md 2>&1 && echo "---" && cat -n docs/index.md

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2370


Add docs/training/megatron-fsdp.md to the "Training and Customization" section in docs/index.md.

The new documentation file must be registered in the table of contents. According to coding guidelines, when a new markdown document is added under docs/**/*.md, docs/index.md must be updated with the document in the appropriate section. Add the entry to the "Training and Customization" toctree section.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/megatron-fsdp.md` at line 1, Add the new
docs/training/megatron-fsdp.md to the table of contents by editing
docs/index.md: locate the "Training and Customization" toctree section and add
an entry for training/megatron-fsdp (use the same relative path style as other
entries in that section). Ensure the new line matches the existing indentation
and ordering convention used in docs/index.md so the file is included in the
generated docs.

Comment on lines +186 to +191

1. **Enable FSDP** in both `dist` and `ddp` configurations:
```python
dist_config.use_megatron_fsdp = True
ddp_config.use_megatron_fsdp = True
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fenced code block is missing a language specifier.

The code block at the start of the migration steps (lines 188–191) contains Python code but has no language tag, which triggers a markdownlint MD040 warning and degrades syntax highlighting.

✏️ Proposed fix
-```
+```python
 dist_config.use_megatron_fsdp = True
 ddp_config.use_megatron_fsdp = True
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 186-186: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/megatron-fsdp.md` around lines 186 - 191, The fenced code block
showing Python config settings (dist_config.use_megatron_fsdp and
ddp_config.use_megatron_fsdp) is missing a language tag; update that markdown
block to use a Python fence by adding the language specifier (```python) so the
two lines dist_config.use_megatron_fsdp = True and ddp_config.use_megatron_fsdp
= True get proper syntax highlighting and resolve the MD040 warning.

Comment on lines +218 to +221
**Limitations of Torch FSDP2**:
- Not currently compatible with Pipeline Parallelism
- Still in experimental stage with potential bugs
- Does not require `fsdp_dtensor` checkpoint format

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Torch FSDP2 checkpoint format guidance is incomplete.

The section states Torch FSDP2 "Does not require fsdp_dtensor checkpoint format" but does not tell the reader which format to use instead. Given the compatibility matrix in checkpointing.md shows torch_dist as the supported format for Torch FSDP2, adding that here avoids ambiguity.

✏️ Proposed fix
-- Does not require `fsdp_dtensor` checkpoint format
+- Does not require `fsdp_dtensor` checkpoint format; use the default `torch_dist` format instead
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Limitations of Torch FSDP2**:
- Not currently compatible with Pipeline Parallelism
- Still in experimental stage with potential bugs
- Does not require `fsdp_dtensor` checkpoint format
**Limitations of Torch FSDP2**:
- Not currently compatible with Pipeline Parallelism
- Still in experimental stage with potential bugs
- Does not require `fsdp_dtensor` checkpoint format; use the default `torch_dist` format instead
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/megatron-fsdp.md` around lines 218 - 221, The doc line about
Torch FSDP2 is ambiguous about checkpoint formats; update the bullet "- Does not
require `fsdp_dtensor` checkpoint format" to explicitly state that Torch FSDP2
uses the `torch_dist` checkpoint format (e.g., "- Uses `torch_dist` checkpoint
format, not `fsdp_dtensor`"), and add a short parenthetical or sentence pointing
readers to the relevant checkpointing.md section for details.

askmanu Bot and others added 2 commits February 25, 2026 18:58
@yaoyu-33 yaoyu-33 added the docs-only With great power comes great responsibility. label Mar 2, 2026

@yaoyu-33 yaoyu-33 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs-only change, LGTM.

@yaoyu-33

yaoyu-33 commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

/ok to test 75838d5

@yaoyu-33 yaoyu-33 merged commit af416ec into NVIDIA-NeMo:main Mar 12, 2026
24 checks passed
Comment on lines +124 to +127
**`zarr`**
- Zarr-based checkpoint format
- Alternative to `torch_dist` for certain use cases
- Compatible with distributed parallelism strategies

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr backend was removed in NVIDIA/Megatron-LM#2944

copy-pr-bot Bot pushed a commit that referenced this pull request Mar 19, 2026
)

Signed-off-by: Andrei Onel <onel@users.noreply.github.com>
Co-authored-by: askmanu[bot] <192355599+askmanu[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Documentation] Update user manual with new MoE features and Megatron FSDP

3 participants