Skip to content

[DSD] Correctly handle shared parameters for optimizer state_dict (#1…#129252

Merged
atalman merged 1 commit intorelease/2.4from
chienchin/cherry-pick-pr-128685
Jun 26, 2024
Merged

[DSD] Correctly handle shared parameters for optimizer state_dict (#1…#129252
atalman merged 1 commit intorelease/2.4from
chienchin/cherry-pick-pr-128685

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Jun 21, 2024

[DSD] Correctly handle shared parameters for optimizer state_dict (#128685)

Fixes #128011

See the discussion in #128076

Current implementation of set_optimizer_state_dict() assumes that all the fqns returned by _get_fqns() must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue.

Differential Revision: D58573487

Pull Request resolved: #128685
Approved by: https://github.com/LucasLLC

(cherry picked from commit 1a52791)

Fixes #ISSUE_NUMBER

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC @MeetVadakkanchery @mhorowitz

…28685)

*
Fixes #128011

See the discussion in #128076

Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue.

Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/)

Pull Request resolved: #128685
Approved by: https://github.com/LucasLLC

(cherry picked from commit 1a52791)
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129252

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 743931d with merge base b66e3f0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants