Skip to content

[DCP] Fixes the stateless optimizer issue of distributed state_dict#135535

Closed
fegin wants to merge 5 commits intogh/fegin/288/basefrom
gh/fegin/288/head
Closed

[DCP] Fixes the stateless optimizer issue of distributed state_dict#135535
fegin wants to merge 5 commits intogh/fegin/288/basefrom
gh/fegin/288/head

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Sep 9, 2024

Stack from ghstack (oldest at bottom):

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: #133415

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135535

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 5 Unrelated Failures

As of commit 87dc47f with merge base 9b76449 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 9, 2024
@fegin fegin requested a review from wz337 September 9, 2024 21:13
@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 9, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. stamp to unblock

{
"use_composable": [True, False],
"optimizer_class": [torch.optim.Adam, torch.optim.AdamW],
"optimizer_class": [torch.optim.SGD],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add torch.optim.SGD to the original list instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

[ghstack-poisoned]
fegin added a commit that referenced this pull request Sep 9, 2024
Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: #133415

ghstack-source-id: 7dd1c09
Pull Request resolved: #135535
@fegin
Copy link
Contributor Author

fegin commented Sep 10, 2024

@pytorchbot mege

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 10, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'mege' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@fegin
Copy link
Contributor Author

fegin commented Sep 10, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@fegin
Copy link
Contributor Author

fegin commented Sep 10, 2024

@pytorchbot merge -f "The failing test is not related."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Sep 13, 2024
…ytorch#135535)

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: pytorch#133415

Pull Request resolved: pytorch#135535
Approved by: https://github.com/wz337
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…ytorch#135535)

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: pytorch#133415

Pull Request resolved: pytorch#135535
Approved by: https://github.com/wz337
kit1980 pushed a commit that referenced this pull request Sep 20, 2024
#136000)

[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535)

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: #133415

Pull Request resolved: #135535
Approved by: https://github.com/wz337

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
@github-actions github-actions bot deleted the gh/fegin/288/head branch October 12, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants