[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor by XilunWu · Pull Request #130551 · pytorch/pytorch

XilunWu · 2024-07-11T17:08:23Z

Stack from ghstack (oldest at bottom):

-> [WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor #130551
[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result #130760
[2/N][dtensor] Strided Sharding shard_to_replicate #130239
[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic #126697
[dtensor] have DTensorSpec report how many shards on each tensor dimension #130587

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-07-11T17:08:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130551

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit 2711050 with merge base dc7725c ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard2DTraining::test_train_parity_2d_mlp
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/distributed/_tensor/placement_types.py:
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/_tensor/test_redistribute.py::RedistributeTest::test_strided_shard_to_replicate_forward_backward
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard2DTraining::test_tp_with_fsdp_offloading
pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
distributed/_tensor/test_redistribute.py::RedistributeTest::test_strided_shard_to_replicate_forward_backward

CANCELLED JOB - The following job was cancelled. Please retry:

pull / linux-focal-py3.8-clang10-onnx / test (default, 1, 2, linux.2xlarge) (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 2e520db Pull Request resolved: #130551

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: e7661bb Pull Request resolved: #130551

fegin · 2024-07-16T17:22:19Z

torch/distributed/_tensor/api.py

+        right_to_left_distribute = (
+            device_mesh.ndim == 1  # the device mesh arg must be a 1-d mesh
+            and len(placements) == 1  # must be True for 1-d mesh
+            and input_parent_mesh is not None
+            and input_parent_mesh == dtensor_parent_mesh  # distribute over the same mesh
+        )


Can we avoid using another indention and instead move the below return tensor and raise exception here? Then we don't need another indention.

fegin · 2024-07-16T17:26:31Z

torch/distributed/_tensor/api.py

+                    "Calling distribute_tensor() on DTensor objects require the "
+                    f"placement be a Shard() placement. Input args: tensor={tensor}, "
+                    f"device_mesh={device_mesh}, placements={placements}."
+                )


replicate() is also required as we need to support HSDP?

fegin · 2024-07-16T17:28:56Z

torch/distributed/_tensor/api.py

+        # mesh, and potentially so on.
+        input_parent_mesh = _mesh_resources.get_parent_mesh(device_mesh)
+        dtensor_parent_mesh = _mesh_resources.get_parent_mesh(tensor.device_mesh)
+        right_to_left_distribute = (


We don't actually check if we are sharding from right to left. If the input mesh is actually in the right of the tensor's submesh, we should raise an exception.

wanchaol

Overall I think this is valuable to add to distribute_tensor, but only after the strided shard placement get matured (i.e. all DTensor ops are working with this placement or we merge this placement back to Shard placemnet).

Before that happens, I think we should focus on making FSDP2 + TP to use it first (i.e. add the integration to FSDP init shard param path)

github-actions · 2024-09-14T19:33:55Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

[3/N]/dtensor] allow distribute_tensor to shard DTensor

05bb415

[ghstack-poisoned]

This was referenced Jul 11, 2024

[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic #126697

Closed

[2/N][dtensor] Strided Sharding shard_to_replicate #130239

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 11, 2024

XilunWu marked this pull request as draft July 11, 2024 17:08

XilunWu changed the title ~~[3/N]/dtensor] allow distribute_tensor to shard DTensor~~ [WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor Jul 11, 2024

Update on "[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor"

d4e718a

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

XilunWu mentioned this pull request Jul 11, 2024

[dtensor] have DTensorSpec report how many shards on each tensor dimension #130587

Closed

XilunWu added a commit that referenced this pull request Jul 11, 2024

[3/N]/dtensor] allow distribute_tensor to shard DTensor

951dfaf

ghstack-source-id: 2e520db Pull Request resolved: #130551

Update on "[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor"

dd9a183

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

XilunWu mentioned this pull request Jul 15, 2024

[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result #130760

Closed

Update on "[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor"

2711050

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

XilunWu added a commit that referenced this pull request Jul 16, 2024

[3/N]/dtensor] allow distribute_tensor to shard DTensor

85f5cfb

ghstack-source-id: e7661bb Pull Request resolved: #130551

fegin reviewed Jul 16, 2024

View reviewed changes

wanchaol reviewed Jul 16, 2024

View reviewed changes

github-actions bot added the Stale label Sep 14, 2024

github-actions bot closed this Oct 14, 2024

github-actions bot deleted the gh/XilunWu/88/head branch November 14, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor#130551

[WIP][3/N]/dtensor] allow distribute_tensor to shard DTensor#130551
XilunWu wants to merge 4 commits intogh/XilunWu/88/basefrom
gh/XilunWu/88/head

XilunWu commented Jul 11, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 11, 2024 •

edited

Loading

Uh oh!

fegin Jul 16, 2024

Uh oh!

fegin Jul 16, 2024

Uh oh!

fegin Jul 16, 2024

Uh oh!

wanchaol left a comment

Uh oh!

github-actions bot commented Sep 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

XilunWu commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130551

❌ 5 New Failures, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

fegin Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

fegin Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

fegin Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XilunWu commented Jul 11, 2024 •

edited

Loading

pytorch-bot bot commented Jul 11, 2024 •

edited

Loading