Skip to content

Support HSDP + Monolith Checkpointing (#128446)#129254

Merged
atalman merged 1 commit intorelease/2.4from
chienchin/cherry-pick-pr-128446
Jun 26, 2024
Merged

Support HSDP + Monolith Checkpointing (#128446)#129254
atalman merged 1 commit intorelease/2.4from
chienchin/cherry-pick-pr-128446

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Jun 21, 2024

Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: #128446
Approved by: https://github.com/fegin

(cherry picked from commit 153362f)

Fixes #ISSUE_NUMBER

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: #128446
Approved by: https://github.com/fegin

(cherry picked from commit 153362f)
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129254

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 8 Unrelated Failures

As of commit 0186007 with merge base b66e3f0 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jun 21, 2024
@atalman atalman merged commit 2bf3798 into release/2.4 Jun 26, 2024
@atalman atalman deleted the chienchin/cherry-pick-pr-128446 branch June 26, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants