[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model by wz337 · Pull Request #135763 · pytorch/pytorch

wz337 · 2024-09-11T22:01:37Z

Stack from ghstack (oldest at bottom):

-> [DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model #135763
[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict #135725

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wconstab @d4l3k @c-p-i-o

This is a workaround for loading full state dict into a FSDP1+TP 2D model.
Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do:

load the full state dict into a 1D FSDP model
dcp.save the full/shard state dict into storage
initialize a 2D FSDP1+TP model
get the default sharded state dict for the 2D model (full_state_dict=False)
dcp.load the state dict from storage
load the state dict into the 2D model

[ghstack-poisoned]

pytorch-bot · 2024-09-11T22:01:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135763

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 4 Unrelated Failures

As of commit bd462ec with merge base 011cae9 ():

NEW FAILURES - The following jobs have failed:

periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1) (gh)
periodic / linux-focal-cuda12.1-py3.10-gcc9 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge) (gh)
'Test'
periodic / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge) (gh)
'Test'
trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
'Test'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/profiler/test_cpp_thread.py::CppThreadTest::test_with_enable_profiler_in_child_thread'

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner-noclang / linux-job (gh) (trunk failure)
>>> Lint for torch/_inductor/compile_fx.py:
pull / linux-docs / build-docs-python-false (gh) (trunk failure)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 3, linux.rocm.gpu, unstable) (gh) (#129209)
distributed/_composable/test_replicate_with_compiler.py::DDP_TP_Test::test_ddp_tp

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: bd1d283 Pull Request resolved: #135763

fegin

Thanks for adding the workaround test!

wz337 · 2024-09-12T19:57:37Z

@pytorchmergebot merge -i

pytorchmergebot · 2024-09-12T19:59:41Z

Merge started

Your change will be merged while ignoring the following 5 checks: Lint / lintrunner-noclang / linux-job, pull / linux-docs / build-docs-python-false, periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1), periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 3, linux.rocm.gpu, unstable), periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 2, 3, linux.rocm.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-12T20:05:13Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

ghstack-source-id: fa14330 Pull Request resolved: #135763

wz337 · 2024-09-13T03:37:53Z

@pytorchmergebot merge -i

pytorchmergebot · 2024-09-13T03:39:41Z

Merge started

Your change will be merged while ignoring the following 8 checks: pull / linux-docs / build-docs-python-false, pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu), Lint / lintrunner-noclang / linux-job, periodic / ios-build-test / build (default, 1, 1, macos-14-xlarge, SIMULATOR, arm64, 1, 0, 1), periodic / linux-focal-cuda12.1-py3.10-gcc9 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge), periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 3, linux.rocm.gpu, unstable), periodic / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge), trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763

… state dict into a 2D model (pytorch#135763) Fix pytorch#134095 This is a workaround for loading full state dict into a FSDP1+TP 2D model. Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do: - load the full state dict into a 1D FSDP model - dcp.save the full/shard state dict into storage - initialize a 2D FSDP1+TP model - get the default sharded state dict for the 2D model (full_state_dict=False) - dcp.load the state dict from storage - load the state dict into the 2D model Pull Request resolved: pytorch#135763 Approved by: https://github.com/fegin ghstack dependencies: pytorch#135725

Pull Request resolved: pytorch#136165 Approved by: https://github.com/kwen2501 ghstack dependencies: pytorch#135725, pytorch#135763

Update

18bfc47

[ghstack-poisoned]

wz337 mentioned this pull request Sep 11, 2024

[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict #135725

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Sep 11, 2024

wz337 changed the title ~~demonstrate how to load 2d~~ [DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model Sep 11, 2024

wz337 requested a review from fegin September 11, 2024 22:05

wz337 mentioned this pull request Sep 12, 2024

[DCP] DCP hangs when full_state_dict=True for FSDP+TP parallelism #134095

Closed

Update

64abaf9

[ghstack-poisoned]

Update

c40e0ee

[ghstack-poisoned]

wz337 added a commit that referenced this pull request Sep 12, 2024

demonstrate how to load 2d

bc0a32b

ghstack-source-id: bd1d283 Pull Request resolved: #135763

wz337 added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Sep 12, 2024

fegin approved these changes Sep 12, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 12, 2024

pytorchmergebot added the merging label Sep 12, 2024

pytorchmergebot removed the merging label Sep 12, 2024

Update

bd462ec

[ghstack-poisoned]

wz337 added a commit that referenced this pull request Sep 12, 2024

demonstrate how to load 2d

5ffc736

ghstack-source-id: fa14330 Pull Request resolved: #135763

pytorchmergebot added the merging label Sep 13, 2024

pytorchmergebot added the Merged label Sep 13, 2024

pytorchmergebot closed this in eea5e6f Sep 13, 2024

pytorchmergebot removed the merging label Sep 13, 2024

pytorchmergebot pushed a commit that referenced this pull request Sep 17, 2024

[DSD][EZ] Minor update in _state_dict_utils.py (#136165)

408fe41

Pull Request resolved: #136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763

github-actions bot deleted the gh/wz337/29/head branch October 14, 2024 06:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model#135763

[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model#135763
wz337 wants to merge 4 commits intogh/wz337/29/basefrom
gh/wz337/29/head

wz337 commented Sep 11, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 11, 2024 •

edited

Loading

Uh oh!

fegin left a comment

Uh oh!

wz337 commented Sep 12, 2024

Uh oh!

pytorchmergebot commented Sep 12, 2024

Uh oh!

pytorchmergebot commented Sep 12, 2024

Uh oh!

wz337 commented Sep 13, 2024

Uh oh!

pytorchmergebot commented Sep 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wz337 commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135763

❌ 4 New Failures, 4 Unrelated Failures

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 commented Sep 12, 2024

Uh oh!

pytorchmergebot commented Sep 12, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 12, 2024

Merge failed

Uh oh!

wz337 commented Sep 13, 2024

Uh oh!

pytorchmergebot commented Sep 13, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wz337 commented Sep 11, 2024 •

edited

Loading

pytorch-bot bot commented Sep 11, 2024 •

edited

Loading