Support HSDP + Monolith Checkpointing#128446
Support HSDP + Monolith Checkpointing#128446mvpatel2000 wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128446
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 1 Unrelated FailureAs of commit 7422e78 with merge base d0c0892 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
| ) -> Dict[str, Any]: | ||
| objects: List[Any] = [None] | ||
| if fsdp_state.rank == 0: | ||
| if dist.get_rank(group) == 0: |
There was a problem hiding this comment.
This looks legit since we use dist.broadcast_object_list(objects, src=0, group=group) in the next few lines. If fsdp_state.rank mismatches dist.broadcast_object_list(objects, src=0, group=group), the broadcast won't work.
|
This PR looks legit, it seems that the HSDP1 does not support broadcast optimizer state_dict correctly. cc., @awgu I'll review it again after all the tests pass. |
|
@fegin is the test failure related to regressions in pytorch main or this PR? They look unrelated so not sure what action item is here |
…cast_from_rank0" HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]
|
@pytorchbot merge -f "The test failure is unrelated." |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…cast_from_rank0" HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]
Fixes #128444. Rank 0 check should be in the same group as the broadcast Pull Request resolved: #128446 Approved by: https://github.com/fegin
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) Pull Request resolved: #128755 Approved by: https://github.com/Skylion007, https://github.com/wz337 ghstack dependencies: #128685
Fixes #128444. Rank 0 check should be in the same group as the broadcast Pull Request resolved: #128446 Approved by: https://github.com/fegin (cherry picked from commit 153362f)
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) Pull Request resolved: #128755 Approved by: https://github.com/Skylion007, https://github.com/wz337 ghstack dependencies: #128685 (cherry picked from commit fe8558b)
…129255) HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from #128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) Pull Request resolved: #128755 Approved by: https://github.com/Skylion007, https://github.com/wz337 ghstack dependencies: #128685 (cherry picked from commit fe8558b)
Fixes #128444. Rank 0 check should be in the same group as the broadcast Pull Request resolved: #128446 Approved by: https://github.com/fegin (cherry picked from commit 153362f) Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Fixes #128444. Rank 0 check should be in the same group as the broadcast
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC @MeetVadakkanchery @mhorowitz