[dtensor] have DTensorSpec report how many shards on each tensor dimension by XilunWu · Pull Request #130587 · pytorch/pytorch

XilunWu · 2024-07-11T23:21:50Z

Stack from ghstack (oldest at bottom):

[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor #130551
[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result #130760
[2/N][dtensor] Strided Sharding shard_to_replicate #130239
[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic #126697
-> [dtensor] have DTensorSpec report how many shards on each tensor dimension #130587

Summary
Add a new property num_shards_map to DTensorSpec denoting how many shards each tensor dimension has. This is necessary for constructing _StridedShard placement when we call distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)]) and the split_factor argument will just be the number of shards on that sharding tensor dim.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

…nsion [ghstack-poisoned]

pytorch-bot · 2024-07-11T23:21:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130587

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit bcdbc23 with merge base dc7725c ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…tensor dimension" **Summary** Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

awgu · 2024-07-16T14:56:35Z

torch/distributed/_tensor/placement_types.py

+    @property
+    def num_shards_map(self) -> List[int]:
+        """
+        dim_map is a property we derive from `placements` of


nit: I think we may prefer to start the comment with "what is num_shards_map?" directly (at least at a high level) and then compare it with dim_map, or do you think that it is a requirement for users to know what is dim_map first before understanding what is num_shards_map?

fegin · 2024-07-16T17:14:27Z

torch/distributed/_tensor/placement_types.py

+        For example, we have a dist tensor of shape [18, 20, 30],
+        a device_mesh ([[0, 1, 2, 3], [4, 5, 6, 7]]), and placements
+        ([Shard(1), Shard(0)]), the num_shards_map of this distributed tensor
+        would be: [4, 2, 1].


Can we also add a test? Maybe we can just this example as the test.

wz337 · 2024-07-16T17:16:41Z

torch/distributed/_tensor/placement_types.py

+        For example, we have a dist tensor of shape [18, 20, 30],
+        a device_mesh ([[0, 1, 2, 3], [4, 5, 6, 7]]), and placements
+        ([Shard(1), Shard(0)]), the num_shards_map of this distributed tensor
+        would be: [4, 2, 1].


This is great! Could we add one more example to show when a tensor_dim is being sharded multiple times, the shards will be calculated globally. I think this is the information that dim_map is not able to capture but num_shards_map can.

wanchaol

LGTM. Could we merge this PR to be together with the PR that actually uses it? (i.e. the FSDP2 integration PR), this way we can add a end to end test to cover this code.

…rrect full_tensor() result ghstack-source-id: b2c26d2 Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: b2c26d2 Pull Request resolved: #130587

…rrect full_tensor() result ghstack-source-id: 519abb0 Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: 519abb0 Pull Request resolved: #130587

…rrect full_tensor() result ghstack-source-id: be8392f Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: be8392f Pull Request resolved: #130587

…rrect full_tensor() result ghstack-source-id: 6f66d4c Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: 6f66d4c Pull Request resolved: #130587

…rrect full_tensor() result ghstack-source-id: 0e83706 Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: 0e83706 Pull Request resolved: #130587

…rrect full_tensor() result ghstack-source-id: a89062b Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: a89062b Pull Request resolved: #130587

…rrect full_tensor() result ghstack-source-id: 1699a7c Pull Request resolved: #130760 [dtensor] have DTensorSpec report how many shards on each tensor dimension ghstack-source-id: 1699a7c Pull Request resolved: #130587

…nsion ghstack-source-id: 82263e6 Pull Request resolved: pytorch/pytorch#130587

[dtensor] have DTensorSpec report how many shards on each tensor dime…

abcd5ba

…nsion [ghstack-poisoned]

This was referenced Jul 11, 2024

[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic #126697

Closed

[2/N][dtensor] Strided Sharding shard_to_replicate #130239

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 11, 2024

XilunWu mentioned this pull request Jul 11, 2024

[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor #130551

Closed

XilunWu requested review from wanchaol and wz337 July 11, 2024 23:35

XilunWu mentioned this pull request Jul 15, 2024

[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result #130760

Closed

awgu reviewed Jul 16, 2024

View reviewed changes

fegin reviewed Jul 16, 2024

View reviewed changes

wz337 reviewed Jul 16, 2024

View reviewed changes

wanchaol reviewed Jul 16, 2024

View reviewed changes

XilunWu closed this Aug 26, 2024

github-actions bot deleted the gh/XilunWu/89/head branch September 28, 2024 02:05

injiiiiil pushed a commit to injiiiiil/654 that referenced this pull request Oct 1, 2024

[dtensor] have DTensorSpec report how many shards on each tensor dime…

04cba77

…nsion ghstack-source-id: 82263e6 Pull Request resolved: pytorch/pytorch#130587

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtensor] have DTensorSpec report how many shards on each tensor dimension#130587

[dtensor] have DTensorSpec report how many shards on each tensor dimension#130587
XilunWu wants to merge 2 commits intogh/XilunWu/89/basefrom
gh/XilunWu/89/head

XilunWu commented Jul 11, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 11, 2024 •

edited

Loading

Uh oh!

awgu Jul 16, 2024

Uh oh!

fegin Jul 16, 2024

Uh oh!

wz337 Jul 16, 2024

Uh oh!

wanchaol left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

XilunWu commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130587

❌ 1 New Failure

Uh oh!

awgu Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

fegin Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

wz337 Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

XilunWu commented Jul 11, 2024 •

edited

Loading

pytorch-bot bot commented Jul 11, 2024 •

edited

Loading