Skip to content

[DTensor] support _StridedShard in view op#159656

Closed
XilunWu wants to merge 2 commits intogh/XilunWu/162/basefrom
gh/XilunWu/162/head
Closed

[DTensor] support _StridedShard in view op#159656
XilunWu wants to merge 2 commits intogh/XilunWu/162/basefrom
gh/XilunWu/162/head

Conversation

@XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Aug 1, 2025

Stack from ghstack (oldest at bottom):

Summary
Some thoughts on view-op and _StridedShard interaction:

  1. _StridedShard has no impact on sharding (i.e. how tensor is partitioned)
    compared to Shard. It only changes how shards permute across the devices.
  2. view() op on DTensor strictly forbids shard redistribution which means if
    view() may cause shard permutation across devices, it should be rejected.
    This is enforced in today's sharding prop for view().
  3. Since DTensor view() won't introduce any redistribution, it's certain that
    placements won't change except the inner dim attribute of Shard
    or _StridedShard.

Therefore, to support _StridedShard in view() op, the only change required
is to keep _StridedShard as _StridedShard in the output spec.

Test
pytest test/distributed/tensor/test_view_ops.py

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @tianyu-l

XilunWu added a commit that referenced this pull request Aug 1, 2025
ghstack-source-id: 247789b
Pull Request resolved: #159656
@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 1, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159656

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit f4bc2d3 with merge base ba37f58 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

  • pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
    /var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@XilunWu XilunWu marked this pull request as draft August 1, 2025 17:27
@XilunWu XilunWu requested a review from wconstab August 1, 2025 17:27
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if you can explain through what are the requirements for the new StridedShard to stll be valid and convince me you're checking all those requirements, then we can land this (but need at test too)

@XilunWu XilunWu added topic: not user facing topic category module: dtensor distributed tensor tag labels Aug 6, 2025
**Summary**
Some thoughts on view-op and `_StridedShard` interaction:
1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned)
compared to `Shard`. It only changes how shards permute across the devices.
2. `view()` op on DTensor strictly forbids shard redistribution which means if
`view()` may cause shard permutation across devices, it should be rejected.
This is enforced in today's sharding prop for `view()`.
3. Since DTensor `view()` won't introduce any redistribution, it's certain that
`placements` won't change except the inner `dim` attribute of `Shard`
or `_StridedShard`.

Therefore, to support `_StridedShard` in `view()` op, the only change required
is to keep `_StridedShard` as `_StridedShard` in the output spec.

**Test**
``

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta tianyu-l

[ghstack-poisoned]
XilunWu added a commit that referenced this pull request Aug 7, 2025
ghstack-source-id: 1095f42
Pull Request resolved: #159656
@XilunWu XilunWu changed the title [DTensor] draft: support _StridedShard in view op [DTensor] support _StridedShard in view op Aug 7, 2025
@XilunWu XilunWu requested a review from wconstab August 7, 2025 00:55
@XilunWu XilunWu marked this pull request as ready for review August 7, 2025 00:55
@XilunWu XilunWu requested review from tianyu-l, wanchaol and zpcore August 7, 2025 00:55
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@XilunWu XilunWu added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 7, 2025
@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 7, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@XilunWu
Copy link
Contributor Author

XilunWu commented Aug 7, 2025

@pytorchbot merge -f "canceld rocm job"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/XilunWu/162/head branch September 7, 2025 02:14
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
**Summary**
Some thoughts on view-op and `_StridedShard` interaction:
1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned)
compared to `Shard`. It only changes how shards permute across the devices.
2. `view()` op on DTensor strictly forbids shard redistribution which means if
`view()` may cause shard permutation across devices, it should be rejected.
This is enforced in today's sharding prop for `view()`.
3. Since DTensor `view()` won't introduce any redistribution, it's certain that
`placements` won't change except the inner `dim` attribute of `Shard`
or `_StridedShard`.

Therefore, to support `_StridedShard` in `view()` op, the only change required
is to keep `_StridedShard` as `_StridedShard` in the output spec.

**Test**
`pytest test/distributed/tensor/test_view_ops.py`

Pull Request resolved: pytorch#159656
Approved by: https://github.com/wconstab
@zpcore
Copy link
Member

zpcore commented Dec 3, 2025

Directly propagate StridedShard may not be always TRUE. Think of the following case:

mesh = init_device_mesh("cpu", (2, 2, 2, 2))
input_tensor = torch.randn(4, 4, 8)
A = distribute_tensor(input_tensor, mesh, [_StridedShard(2, split_factor=2), Shard(2), Replicate(), Replicate()])
B = A.view(4, 4, 2, 4)

Then B.placements is expected to be [Shard(3), Shard(2), Replicate(), Replicate()]. No split_factor any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dtensor distributed tensor tag oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants