[Cherry-pick][DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725) and Fix loading uneven full tensor into sharded state dict (#136365) by wz337 · Pull Request #136903 · pytorch/pytorch

wz337 · 2024-09-27T22:27:16Z

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wconstab @d4l3k @c-p-i-o @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

…et_state_dict (pytorch#135725) Fix pytorch#134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: pytorch#135725 Approved by: https://github.com/fegin (cherry picked from commit 0cdc6a8)

…#136365) Fix pytorch#136228. This is a follow up on pytorch#135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: pytorch#136365 Approved by: https://github.com/fegin (cherry picked from commit 637d5c4)

pytorch-bot · 2024-09-27T22:27:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136903

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 08ae534 with merge base b7eb725 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
'test/profiler/test_cpp_thread.py::CppThreadTest::test_with_enable_profiler_in_child_thread'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3.9-clang10 / test (dynamo, 2, 3, linux.2xlarge) (gh) (disabled by #134602)
test_transformers.py::TestSDPAPrivateUse1Only::test_scaled_dot_product_fused_attention_overrideable_backward

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wz337 added 2 commits September 27, 2024 15:24

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 27, 2024

wz337 added this to the 2.5.0 milestone Sep 27, 2024

wz337 marked this pull request as ready for review September 27, 2024 22:28

wz337 mentioned this pull request Sep 27, 2024

[v.2.5.0] Release Tracker #135522

Closed

kit1980 approved these changes Sep 30, 2024

View reviewed changes

kit1980 merged commit 70298e9 into pytorch:release/2.5 Sep 30, 2024

atalman mentioned this pull request Oct 8, 2024

Release 2.5.0 validations checklist and cherry-picks #137492

Closed

56 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-pick][DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725) and Fix loading uneven full tensor into sharded state dict (#136365)#136903

[Cherry-pick][DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725) and Fix loading uneven full tensor into sharded state dict (#136365)#136903
kit1980 merged 2 commits intopytorch:release/2.5from
wz337:release/2.5

wz337 commented Sep 27, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 27, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wz337 commented Sep 27, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136903

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wz337 commented Sep 27, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 27, 2024 •

edited

Loading