Skip to content

[2/N][dtensor] Strided Sharding shard_to_replicate#130239

Closed
XilunWu wants to merge 10 commits intogh/XilunWu/87/basefrom
gh/XilunWu/87/head
Closed

[2/N][dtensor] Strided Sharding shard_to_replicate#130239
XilunWu wants to merge 10 commits intogh/XilunWu/87/basefrom
gh/XilunWu/87/head

Conversation

@XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Jul 8, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130239

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bbd3f0a with merge base da32021 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 8, 2024
XilunWu added a commit that referenced this pull request Jul 8, 2024
ghstack-source-id: 7ddecc0
Pull Request resolved: #130239
@XilunWu XilunWu marked this pull request as draft July 8, 2024 08:49
@XilunWu XilunWu changed the title [2/N][dtensor] Strided Sharding shard_to_replicate [WIP][2/N][dtensor] Strided Sharding shard_to_replicate Jul 8, 2024
@XilunWu XilunWu requested review from wanchaol and wz337 July 8, 2024 08:50
**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Jul 11, 2024
ghstack-source-id: 5872742
Pull Request resolved: #130239
**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024
**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
@XilunWu XilunWu marked this pull request as ready for review July 23, 2024 21:54
@XilunWu XilunWu changed the title [WIP][2/N][dtensor] Strided Sharding shard_to_replicate [2/N][dtensor] Strided Sharding shard_to_replicate Jul 23, 2024
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

** Summary **
This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
** Summary **
This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu fegin wanchaol fduwjj wz337 tianyu-l wconstab chauhang d4l3k

[ghstack-poisoned]
@XilunWu XilunWu added the topic: not user facing topic category label Aug 1, 2024
@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 1, 2024

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

** Summary **
This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse tianyu-l chauhang

Differential Revision: [D60606117](https://our.internmc.facebook.com/intern/diff/D60606117)

[ghstack-poisoned]
@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 6, 2024

@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

** Summary **
This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse tianyu-l chauhang

Differential Revision: [D60606117](https://our.internmc.facebook.com/intern/diff/D60606117)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Aug 7, 2024
**Summary**
1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim.

**Test**
`test/distributed/_tensor/test_utils.py`

Pull Request resolved: #132391
Approved by: https://github.com/wanchaol
ghstack dependencies: #126697, #130239
pytorchmergebot pushed a commit that referenced this pull request Aug 7, 2024
**Test**
`pytest test/distributed/_composable/fsdp/test_fully_shard_training.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`

Pull Request resolved: #131408
Approved by: https://github.com/fegin
ghstack dependencies: #126697, #130239, #132391
pytorchmergebot pushed a commit that referenced this pull request Aug 8, 2024
…rrect full_tensor() result (#130760)

Fixes issue #129229 #129206
**Summary**

1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding
2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result
3. Re-enabled the tests that were disabled in #129519

**test**
`pytest test/distributed/_composable/fsdp/`
`pytest test/distributed/_composable/test_composability/test_2d_composability.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`

Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114)
Pull Request resolved: #130760
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337
ghstack dependencies: #126697, #130239, #132391, #131408
@github-actions github-actions bot deleted the gh/XilunWu/87/head branch September 8, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants