Skip to content

[DCP] Fixes the stateless optimizer issue of distributed state_dict (…#136000

Merged
kit1980 merged 1 commit intopytorch:release/2.5from
Skylion007:skylion007/cherry-pick-1d9feff
Sep 20, 2024
Merged

[DCP] Fixes the stateless optimizer issue of distributed state_dict (…#136000
kit1980 merged 1 commit intopytorch:release/2.5from
Skylion007:skylion007/cherry-pick-1d9feff

Conversation

@Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Sep 13, 2024

#135535)

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: #133415

Pull Request resolved: #135535
Approved by: https://github.com/wz337

Fixes #133415

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

…ytorch#135535)

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: pytorch#133415

Pull Request resolved: pytorch#135535
Approved by: https://github.com/wz337
@Skylion007 Skylion007 requested a review from malfet September 13, 2024 14:54
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136000

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 3912a7c with merge base b7eb725 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@fegin
Copy link
Contributor

fegin commented Sep 13, 2024

@Skylion007 Thanks for the patch. The original diff was landed on 9/9 so I thought it should be in 2.5 :(. cc., @malfet

@kit1980 kit1980 merged commit 1db2a65 into pytorch:release/2.5 Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants