Unify and refactor Megatron-FSDP documentation.#4418
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
8a3395c to
6cdf25c
Compare
wujingyue
left a comment
There was a problem hiding this comment.
LGTM in general. Thanks for consolidating the docs!
|
|
||
| ## Megatron-FSDP Feature Guide & API | ||
|
|
||
| | Optimization | Description | Config | |
There was a problem hiding this comment.
I'm not so sure about the purpose of this table.
If it's for developers, it misses many key optimizations such as double buffering.
If it's for the user, do you expect users to proactively tune these configs? Many if not all of them should be by default on.
There was a problem hiding this comment.
These are MLM arguments, they should definitely not be turned on by default. Default should be DDP.
These are the bare minimum arguments someone would want to turn on to use Megatron-FSDP.
Technically, very recent PR's make it so that the following are optional:
--ckpt-format fsdp_dtensor--use-distributed-optimizer--fsdp-manual-registration
but I need to let them stew for a bit before I document this. Plus, they are informative. We should be able to soon reduce the number of configs needed to use FSDP.
There was a problem hiding this comment.
Got it.
To make sure I understood you, there are three mutually exclusive ways to enable MFSDP:
- MLM
FullyShardedDataParallelfully_shard
Here you are listing the flags for just the first way. Correct?
There was a problem hiding this comment.
Yes, except it's more like:
- MLM /
FullyShardedDataParallel(Adapter, equivalent to MCore's DDP wrapper.)- We might still need an adapter like this, even after the rewrite, though we simultaneously want something more modular with MiMo, so we need to code it out and see what it looks like.
fully_shard
Both initialize MegatronFSDP.
There was a problem hiding this comment.
Could the doc somehow reflect that? To a new user or developer like me, it's not immediately clear that there are two disjoint ways of using MFSDP and which configs are for which way. Maybe the doc can have an API section where we talk about FullyShardedDataParallel (including its configs) and fully_shard (including its configs) in separate subsections. Then, the doc can talk about optimizations and implementation details.
There was a problem hiding this comment.
This will also clarify my other question, which might not be obvious to a new user. I think FullyShardedDataParallel is deeply integrated to MCore but fully_shard is not (in fact designed not to be).
There was a problem hiding this comment.
I think I'll add another column to our tables that clarifies fully-shard vs. MCore arguments, I really want this current design where the explanation is very close to the arguments, and the arguments are at the beginning of every section.
5ed15ab to
e528ba5
Compare
|
CC @megnvidia for technical documentation review |
e67c019 to
4653991
Compare
Signed-off-by: Cory Ye <cye@nvidia.com>
23b0953 to
b361c59
Compare
Signed-off-by: Cory Ye <cye@nvidia.com>
| - **Buffer Management**: Efficient use of storage and [user buffer registration with NCCL](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html#user-buffer-registration) allows Megatron-FSDP to leverage **NCCL Symmetric Memory**, introduced in NCCL `v2.27`, which enables switch offloading for **Multi-Node NVLink (MNNVL)** systems such as `GB200` / `GB300`, as well as other optimizations such as high-precision reduction collectives and zero-`COPY`. | ||
| - **Communication Overlap**: `All-Gather` (AG) and `Reduce-Scatter` (RS) collectives are optimized to be precisely overlapped with compute using various CUDA streams. | ||
| - **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) Mixed-Precision & Fused Kernels**: Native performance- and memory-optimal compatibility with MXFP8, NVFP4, and various other quantization recipes and fused kernels provided by TransformerEngine. | ||
| - **Optimized Communication & SM Utilization via SHARP**: Leverages [**SHARP** (Scalable Hierarchical Aggregation and Reduction Protocol)](https://docs.nvidia.com/networking/display/sharpv3130) to offload FSDP collectives to network switches (InfiniBand or NVLink-Switch) and significantly reduce utilization of GPU streaming multi-processors (SM) from 16-32 to 1-6, which lowers communication latency in large scaled-out workloads and frees up GPU-hosted processors for overlapped compute (GEMM) kernels. When FSDP sharding domains span both NVLink and InfiniBand, **hierarchical SHARP collectives** (NVL-SHARP and IB-SHARP) optimize communication paths across the entire system topology. |
There was a problem hiding this comment.
I think we should document the following zero-copy optimizations we implement:
- Eliminate Parameter Copies Around Communication: Avoid redundant parameter copies before and after collectives, while remaining compatible with selected DTensor features such as torch distributed. checkpointing (Add to the Advanced Bucketing section.)
- Zero-Copy Communication via NCCL-UBR: Enable direct communication into NCCL-managed memory through NCCL User Buffer Registration (NCCL-UBR), achieving true zero-copy data movement. (Add to the Optimized Communication & SM Utilization via SHARP section.)
- Gradient Copy Avoidance with TransformerEngine: Leverage TransformerEngine’s fuse_wgrad_accumulation to eliminate intermediate gradient copies during backward pass. (to be documented under the TransformerEngine section.)
Could we also add support for Hybrid FSDP + outer-dp operates in ZeRO-1 mode? This would provide a more flexible and efficient trade-off between memory footprint and communication overhead.
Note on Communication Overlap: AG/RS overlap with compute does not demonstrate a clear advantage over other ZeRO-DP approaches in practice. This feature may be omitted unless further differentiation is demonstrated.
Signed-off-by: Cory Ye <cye@nvidia.com>
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25132181038 |
Signed-off-by: Cory Ye <cye@nvidia.com> Co-authored-by: megnvidia <mmiranda@nvidia.com>
What does this PR do ?
Usage
To build the docs:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.