Skip to content

[3/N][dtensor] Strided Sharding offset calculation util#132391

Closed
XilunWu wants to merge 6 commits intogh/XilunWu/92/basefrom
gh/XilunWu/92/head
Closed

[3/N][dtensor] Strided Sharding offset calculation util#132391
XilunWu wants to merge 6 commits intogh/XilunWu/92/basefrom
gh/XilunWu/92/head

Conversation

@XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Aug 1, 2024

Stack from ghstack (oldest at bottom):

Summary

  1. change compute_local_shape_and_global_offset to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
  2. Add a new property num_shards_map to DTensorSpec denoting how many shards each tensor dimension has. This is necessary for constructing _StridedShard placement when we call distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)]) and the split_factor argument will just be the number of shards on that sharding tensor dim.

Test
test/distributed/_tensor/test_utils.py

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132391

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7d064d6 with merge base da32021 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@XilunWu XilunWu requested review from fegin, wanchaol and wz337 August 1, 2024 09:07
@XilunWu XilunWu added the topic: not user facing topic category label Aug 1, 2024
@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 1, 2024

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

XilunWu added 2 commits August 5, 2024 16:09
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115)

[ghstack-poisoned]
**Summary**
1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim.

**Test**
`test/distributed/_tensor/test_utils.py`

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115)

[ghstack-poisoned]
@XilunWu XilunWu requested a review from wanchaol August 6, 2024 10:10
@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 6, 2024

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

**Summary**
1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim.

**Test**
`test/distributed/_tensor/test_utils.py`

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115)

[ghstack-poisoned]
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, have one more comment, please address before landing

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 6, 2024
**Summary**
1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim.

**Test**
`test/distributed/_tensor/test_utils.py`

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

Differential Revision: [D60606115](https://our.internmc.facebook.com/intern/diff/D60606115)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Aug 7, 2024
**Test**
`pytest test/distributed/_composable/fsdp/test_fully_shard_training.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`

Pull Request resolved: #131408
Approved by: https://github.com/fegin
ghstack dependencies: #126697, #130239, #132391
pytorchmergebot pushed a commit that referenced this pull request Aug 8, 2024
…rrect full_tensor() result (#130760)

Fixes issue #129229 #129206
**Summary**

1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding
2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result
3. Re-enabled the tests that were disabled in #129519

**test**
`pytest test/distributed/_composable/fsdp/`
`pytest test/distributed/_composable/test_composability/test_2d_composability.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`

Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114)
Pull Request resolved: #130760
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337
ghstack dependencies: #126697, #130239, #132391, #131408
@github-actions github-actions bot deleted the gh/XilunWu/92/head branch September 8, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants