[3/N][dtensor] Strided Sharding offset calculation util by XilunWu · Pull Request #132391 · pytorch/pytorch

XilunWu · 2024-08-01T07:55:36Z

Stack from ghstack (oldest at bottom):

Summary

change compute_local_shape_and_global_offset to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
Add a new property num_shards_map to DTensorSpec denoting how many shards each tensor dimension has. This is necessary for constructing _StridedShard placement when we call distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)]) and the split_factor argument will just be the number of shards on that sharding tensor dim.

Test
test/distributed/_tensor/test_utils.py

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-08-01T07:55:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132391

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7d064d6 with merge base da32021 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

XilunWu · 2024-08-01T19:08:25Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/distributed/_tensor/_utils.py

test/distributed/_tensor/test_utils.py

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115) [ghstack-poisoned]

**Summary** 1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding). 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. **Test** `test/distributed/_tensor/test_utils.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115) [ghstack-poisoned]

XilunWu · 2024-08-06T10:53:26Z

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

**Summary** 1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding). 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. **Test** `test/distributed/_tensor/test_utils.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115) [ghstack-poisoned]

wanchaol

lgtm, have one more comment, please address before landing

torch/distributed/_tensor/_utils.py

**Summary** 1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding). 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. **Test** `test/distributed/_tensor/test_utils.py` cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115) [ghstack-poisoned]

**Test** `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: #131408 Approved by: https://github.com/fegin ghstack dependencies: #126697, #130239, #132391

…rrect full_tensor() result (#130760) Fixes issue #129229 #129206 **Summary** 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 3. Re-enabled the tests that were disabled in #129519 **test** `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) Pull Request resolved: #130760 Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337 ghstack dependencies: #126697, #130239, #132391, #131408

[3/N][dtensor] Strided Sharding offset calculation util

aae9f5e

[ghstack-poisoned]

XilunWu mentioned this pull request Jul 25, 2024

[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic #126697

Closed

XilunWu mentioned this pull request Aug 1, 2024

[2/N][dtensor] Strided Sharding shard_to_replicate #130239

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 1, 2024

This was referenced Aug 1, 2024

[FSDP][dtensor] add FSDP2+TP distributed state dict test #131408

Closed

[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result #130760

Closed

Update on "[3/N][dtensor] Strided Sharding offset calculation util"

30e8fa5

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

XilunWu requested review from fegin, wanchaol and wz337 August 1, 2024 09:07

XilunWu added the topic: not user facing topic category label Aug 1, 2024

wanchaol reviewed Aug 2, 2024

View reviewed changes

torch/distributed/_tensor/_utils.py Outdated Show resolved Hide resolved

test/distributed/_tensor/test_utils.py Show resolved Hide resolved

XilunWu added 2 commits August 5, 2024 16:09

Update on "[3/N][dtensor] Strided Sharding offset calculation util"

809b7ce

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115) [ghstack-poisoned]

XilunWu requested a review from wanchaol August 6, 2024 10:10

wanchaol approved these changes Aug 6, 2024

View reviewed changes

torch/distributed/_tensor/_utils.py Show resolved Hide resolved

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 6, 2024

pytorchmergebot closed this in ad0ce89 Aug 7, 2024

pytorchmergebot added the Merged label Aug 7, 2024

github-actions bot deleted the gh/XilunWu/92/head branch September 8, 2024 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/N][dtensor] Strided Sharding offset calculation util#132391

[3/N][dtensor] Strided Sharding offset calculation util#132391
XilunWu wants to merge 6 commits intogh/XilunWu/92/basefrom
gh/XilunWu/92/head

XilunWu commented Aug 1, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 1, 2024 •

edited

Loading

Uh oh!

XilunWu commented Aug 1, 2024

Uh oh!

Uh oh!

Uh oh!

XilunWu commented Aug 6, 2024

Uh oh!

wanchaol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

XilunWu commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132391

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

XilunWu commented Aug 1, 2024

Uh oh!

Uh oh!

Uh oh!

XilunWu commented Aug 6, 2024

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XilunWu commented Aug 1, 2024 •

edited

Loading

pytorch-bot bot commented Aug 1, 2024 •

edited

Loading