Skip to content

Support HSDP + Monolith Checkpointing#128446

Closed
mvpatel2000 wants to merge 1 commit intopytorch:mainfrom
mvpatel2000:patch-6
Closed

Support HSDP + Monolith Checkpointing#128446
mvpatel2000 wants to merge 1 commit intopytorch:mainfrom
mvpatel2000:patch-6

Conversation

@mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Jun 11, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128446

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Unrelated Failure

As of commit 7422e78 with merge base d0c0892 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jun 11, 2024
@colesbury colesbury requested a review from fegin June 12, 2024 01:36
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 12, 2024
@Skylion007
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #128446, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@Skylion007
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased patch-6 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout patch-6 && git pull --rebase)

@Skylion007 Skylion007 changed the title Support HSDP + Monolith Checkpoiting Support HSDP + Monolith Checkpointing Jun 13, 2024
@fegin fegin added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Jun 14, 2024
) -> Dict[str, Any]:
objects: List[Any] = [None]
if fsdp_state.rank == 0:
if dist.get_rank(group) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks legit since we use dist.broadcast_object_list(objects, src=0, group=group) in the next few lines. If fsdp_state.rank mismatches dist.broadcast_object_list(objects, src=0, group=group), the broadcast won't work.

@fegin
Copy link
Contributor

fegin commented Jun 14, 2024

This PR looks legit, it seems that the HSDP1 does not support broadcast optimizer state_dict correctly. cc., @awgu I'll review it again after all the tests pass.

@mvpatel2000
Copy link
Contributor Author

@fegin is the test failure related to regressions in pytorch main or this PR? They look unrelated so not sure what action item is here

fegin added a commit that referenced this pull request Jun 17, 2024
…cast_from_rank0"


HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jun 17, 2024
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz

[ghstack-poisoned]
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#128755 is the unittest to verify this PR.

@fegin
Copy link
Contributor

fegin commented Jun 17, 2024

@pytorchbot merge -f "The test failure is unrelated."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

fegin added a commit that referenced this pull request Jun 17, 2024
…cast_from_rank0"


HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz

[ghstack-poisoned]
fegin added a commit that referenced this pull request Jun 17, 2024
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz

[ghstack-poisoned]
PaliC pushed a commit that referenced this pull request Jun 17, 2024
Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: #128446
Approved by: https://github.com/fegin
pytorchmergebot pushed a commit that referenced this pull request Jun 18, 2024
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

Pull Request resolved: #128755
Approved by: https://github.com/Skylion007, https://github.com/wz337
ghstack dependencies: #128685
fegin pushed a commit that referenced this pull request Jun 21, 2024
Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: #128446
Approved by: https://github.com/fegin

(cherry picked from commit 153362f)
fegin added a commit that referenced this pull request Jun 21, 2024
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

Pull Request resolved: #128755
Approved by: https://github.com/Skylion007, https://github.com/wz337
ghstack dependencies: #128685

(cherry picked from commit fe8558b)
atalman pushed a commit that referenced this pull request Jun 26, 2024
…129255)

HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from #128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

Pull Request resolved: #128755
Approved by: https://github.com/Skylion007, https://github.com/wz337
ghstack dependencies: #128685

(cherry picked from commit fe8558b)
atalman pushed a commit that referenced this pull request Jun 26, 2024
Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: #128446
Approved by: https://github.com/fegin

(cherry picked from commit 153362f)

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (checkpoint) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HSDP + set_optimizer_state_dict errors with monolithic checkpointing

7 participants