[DTensor] fix copy_ strategy by wconstab · Pull Request #158538 · pytorch/pytorch

wconstab · 2025-07-17T05:03:43Z

Stack from ghstack (oldest at bottom):

The previous strategy directly used 'self' input strategy for 'src'
input. The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)

These are the correct sharding combinations:

self	src
Shard(0)	Replicate()
Shard(1)	Replicate()
Shard(2)	Shard(0)
Shard(3)	Shard(1)

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-07-17T05:03:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158538

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 3e65cb3 with merge base 1e86fa2 ():

NEW FAILURE - The following job has failed:

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
'Test'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2) (gh) (similar failure)
test_nn.py::TestNNDeviceTypeCUDA::test_avg_pool_large_tensor2_cuda

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 06392d0 Pull Request resolved: #158538

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

ghstack-source-id: 555e5eb Pull Request resolved: #158538

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) These are the correct sharding combinations: self | src ------------------------ Shard(0) | Replicate() Shard(1) | Replicate() Shard(2) | Shard(0) Shard(3) | Shard(1) ghstack-source-id: d026b77 Pull Request resolved: #158538

The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) These are the correct sharding combinations: self | src ------------------------ Shard(0) | Replicate() Shard(1) | Replicate() Shard(2) | Shard(0) Shard(3) | Shard(1) [ghstack-poisoned]

The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) These are the correct sharding combinations: self | src ------------------------ Shard(0) | Replicate() Shard(1) | Replicate() Shard(2) | Shard(0) Shard(3) | Shard(1) ghstack-source-id: d026b77 Pull Request resolved: #158538

zpcore · 2025-07-17T21:26:59Z

torch/distributed/tensor/_ops/_tensor_ops.py

-            for strategy in first_input_strategy.strategies
-        ]
-    )
+        )


The specs looks good to me, can you also add the redistribute_cost?

oh yea! thanks. i had it in my first version and then forgot about it.

Can we actually make it a required argument for creating an OpSpec? (Unless we do your proposal of autogenerating them always, which i prefer)

Yea, I think we can make it a required field once we fix all currently supported ops in the list #157495.

The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. ``` self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) ``` These are the correct sharding combinations: | self | src | |-------|------| | Shard(0) | Replicate() | | Shard(1) | Replicate() | | Shard(2) | Shard(0) | | Shard(3) | Shard(1) | [ghstack-poisoned]

The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) These are the correct sharding combinations: self | src ------------------------ Shard(0) | Replicate() Shard(1) | Replicate() Shard(2) | Shard(0) Shard(3) | Shard(1) ghstack-source-id: d339bbc Pull Request resolved: #158538

zpcore

LGTM!

XilunWu

some nits

XilunWu · 2025-07-17T22:21:14Z