[Dev] docs(megatron-fsdp): add Megatron-FSDP user guide#2397
Conversation
|
62101b8 to
55e3025
Compare
|
LGTM. Could you please help delete the Optimizing DeepSeek-V3 Training Performance on NVIDIA GB200 NVL72 part in the |
|
/ok to test 5139bcd |
|
|
||
| Allows model initialization using meta device, followed by layer-by-layer initialization of distributed model weight buffers via the `Module.reset_parameters` API, facilitating the initialization of extremely large models. | ||
|
|
||
| #### 4. Add `--grad-reduce-in-bf16` |
There was a problem hiding this comment.
This has numerical issues at large scale though right? Like very large models at very large batch sizes? Or have you not observed this for pretraining very large 100B+ models?
There was a problem hiding this comment.
Thanks for bringing this up. Have you run into concrete numerical issues at very large model and batch sizes in your experiments?
|
|
||
| ## Checkpoint Conversion from 3D-Parallel to Megatron-FSDP | ||
|
|
||
| Megatron-FSDP introduces a new checkpoint format `fsdp_dtensor`. To help you smoothly transition from 3D-Parallel to Megatron-FSDP, we provide a script for converting checkpoints from the `torch_dist` format to the `fsdp_dtensor` format. Using DeepSeek-V3 as an example, the detailed conversion process is described below. |
There was a problem hiding this comment.
Thanks for your contribution, it's really a great starting point for me to try the Megatron-FSDP. Since our subsequent evaluation tasks are based on the torch_dist ckpt type, could you provide a path to convert ckpt format from fsdp_dtensor to torch_dist or huggingface? Or huggingface format?
There was a problem hiding this comment.
For checkpoint conversion, this feature will be added to Megatron-Bridge and is currently under development. Once it is ready, it will support converting between fsdp_dtensor and Hugging Face formats. Since Megatron-Bridge already supports conversion between torch_dist and Hugging Face formats, this will make it straightforward to go from fsdp_dtensor to torch_dist as well.
b24264f to
8283f7b
Compare
|
/ok to test 8283f7b |
What does this PR do ?
main PR: #2396
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.