[DSD] Fix loading uneven full tensor into sharded state dict by wz337 · Pull Request #136365 · pytorch/pytorch

wz337 · 2024-09-20T17:35:54Z

Stack from ghstack (oldest at bottom):

-> [DSD] Fix loading uneven full tensor into sharded state dict #136365

This is a follow up on #135725. We need to pass shape and stride from the original dtensor, since for uneven case, from_local would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wconstab @d4l3k @c-p-i-o @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

[ghstack-poisoned]

pytorch-bot · 2024-09-20T17:35:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136365

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit e33a4b3 with merge base d3647d1 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 3, 3, linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_AVX512, 2, 2, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 1, 2, lf.linux.2xlarge) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (disabled by #128551 but the issue was closed recently and a rebase is needed to make it pass)
test_dataloader.py::TestDataLoader::test_segfault

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: e6b781e Pull Request resolved: #136365 fix ghstack-source-id: e6b781e Pull Request resolved: #136366

kwen2501 · 2024-09-23T07:22:57Z

test/distributed/checkpoint/test_state_dict_utils.py


+    @with_comms
+    @skip_if_lt_x_gpu(2)
+    def test_state_dict_util_distribute_tensors(self):


nit: comment on purpose of test, expected results, etc

wz337 · 2024-09-23T12:36:15Z

@pytorchmergebot merge

pytorchmergebot · 2024-09-23T12:37:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-23T14:03:07Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 1, 2, lf.linux.2xlarge)

Details for Dev Infra team

Raised by workflow job

wz337 · 2024-09-23T16:28:18Z

@pytorchmergebot merge -i

pytorchmergebot · 2024-09-23T16:30:07Z

Merge started

Your change will be merged while ignoring the following 8 checks: pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge), pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 3, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge), pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, lf.linux.2xlarge), trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_AVX512, 2, 2, lf.linux.2xlarge), trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 1, 2, lf.linux.2xlarge), trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu), trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, lf.linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…#136365) Fix pytorch#136228. This is a follow up on pytorch#135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: pytorch#136365 Approved by: https://github.com/fegin

…#136365) Fix pytorch#136228. This is a follow up on pytorch#135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: pytorch#136365 Approved by: https://github.com/fegin (cherry picked from commit 637d5c4)

…hang during set_state_dict (#135725) and Fix loading uneven full tensor into sharded state dict (#136365) (#136903) * [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725) Fix #134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: #135725 Approved by: https://github.com/fegin (cherry picked from commit 0cdc6a8) * [DSD] Fix loading uneven full tensor into sharded state dict (#136365) Fix #136228. This is a follow up on #135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: #136365 Approved by: https://github.com/fegin (cherry picked from commit 637d5c4)

Update

b70ce75

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 20, 2024

wz337 mentioned this pull request Sep 20, 2024

fix #136366

Closed

Update

d66c464

[ghstack-poisoned]

pytorch-bot bot added the module: distributed_checkpoint label Sep 20, 2024

Update

e33a4b3

[ghstack-poisoned]

wz337 added a commit that referenced this pull request Sep 20, 2024

add shape and stride from local state

ba586ef

ghstack-source-id: e6b781e Pull Request resolved: #136365 fix ghstack-source-id: e6b781e Pull Request resolved: #136366

wz337 marked this pull request as draft September 20, 2024 18:27

wz337 added the topic: not user facing topic category label Sep 20, 2024

wz337 changed the title ~~add shape and stride from local state~~ [DSD] add shape and stride from local state Sep 20, 2024

wz337 changed the title ~~[DSD] add shape and stride from local state~~ [DSD] Fix loading uneven full tensor into sharded state dict Sep 20, 2024

wz337 marked this pull request as ready for review September 20, 2024 19:58

wz337 requested a review from fegin September 20, 2024 19:59

wz337 added the topic: bug fixes topic category label Sep 20, 2024

fegin approved these changes Sep 20, 2024

View reviewed changes

kwen2501 reviewed Sep 23, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 23, 2024

pytorchmergebot added the merging label Sep 23, 2024

pytorchmergebot removed the merging label Sep 23, 2024

pytorchmergebot added the merging label Sep 23, 2024

pytorchmergebot added the Merged label Sep 23, 2024

pytorchmergebot closed this in 637d5c4 Sep 23, 2024

pytorchmergebot removed the merging label Sep 23, 2024

int3 mentioned this pull request Sep 24, 2024

Correctly convert Python float to float64 when passing argument as Tensor #136413

Closed

wz337 mentioned this pull request Sep 27, 2024

[v.2.5.0] Release Tracker #135522

Closed

github-actions bot deleted the gh/wz337/32/head branch October 25, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DSD] Fix loading uneven full tensor into sharded state dict#136365

[DSD] Fix loading uneven full tensor into sharded state dict#136365
wz337 wants to merge 3 commits intogh/wz337/32/basefrom
gh/wz337/32/head

wz337 commented Sep 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 20, 2024 •

edited

Loading

Uh oh!

kwen2501 Sep 23, 2024

Uh oh!

wz337 commented Sep 23, 2024

Uh oh!

pytorchmergebot commented Sep 23, 2024

Uh oh!

pytorchmergebot commented Sep 23, 2024

Uh oh!

wz337 commented Sep 23, 2024

Uh oh!

pytorchmergebot commented Sep 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wz337 commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136365

✅ You can merge normally! (8 Unrelated Failures)

Uh oh!

kwen2501 Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

wz337 commented Sep 23, 2024

Uh oh!

pytorchmergebot commented Sep 23, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 23, 2024

Merge failed

Uh oh!

wz337 commented Sep 23, 2024

Uh oh!

pytorchmergebot commented Sep 23, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wz337 commented Sep 20, 2024 •

edited

Loading

pytorch-bot bot commented Sep 20, 2024 •

edited

Loading