[DTensor] Document redistribute_costs by wconstab · Pull Request #158495 · pytorch/pytorch

wconstab · 2025-07-16T22:15:38Z

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-07-16T22:15:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158495

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 87b7510 with merge base 900fba4 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: eb3da25 Pull Request resolved: #158495

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

ghstack-source-id: fc49f31 Pull Request resolved: #158495

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

ghstack-source-id: b88f375 Pull Request resolved: #158495

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

ghstack-source-id: e3b9b6c Pull Request resolved: #158495

zpcore · 2025-07-17T18:14:49Z

torch/distributed/tensor/_op_schema.py

+            0.0,  # cost of redistributing tensor_a from 'Replicate()'
+            K,    # cost of redistributing tensor_a from 'Shard(0)'


What about:
0.0, # cost of redistributing tensor_a from Replicate() -> Replicate()
K, # cost of redistributing tensor_a from 'Shard(0)' -> Replicate()

zpcore

LGTM!

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

[ghstack-poisoned]

wconstab · 2025-07-17T21:44:22Z

@pytorchbot merge -i

XilunWu

LGTM

pytorchmergebot · 2025-07-17T21:46:34Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-07-18T00:33:41Z

Starting merge as part of PR stack under #158490

Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object - Currently, using default_strategy is incorrect becuase it maps 'self' tensor's strategies directly onto 'src' tensor without accounting for the fact that copy_ supports broadcasting a smaller rank tensor into a larger one. Separates out copy_ op from default strategy, adds missing test case, but does not fix the underlying issue with copy_, leaves that for future PR Renames to `propagate_single_input_strategy` since that's more descriptive Pull Request resolved: #158490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: #158495

The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. ``` self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) ``` These are the correct sharding combinations: | self | src | |-------|------| | Shard(0) | Replicate() | | Shard(1) | Replicate() | | Shard(2) | Shard(0) | | Shard(3) | Shard(1) | Pull Request resolved: #158538 Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #158495, #158490

[DTensor] Document redistribute_costs

e98e0b2

[ghstack-poisoned]

wconstab mentioned this pull request Jul 16, 2025

[DTensor] Fix default_strategy and rename for clarity #158490

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 16, 2025

wconstab added a commit that referenced this pull request Jul 16, 2025

[DTensor] Document redistribute_costs

d03da24

ghstack-source-id: eb3da25 Pull Request resolved: #158495

Update on "[DTensor] Document redistribute_costs"

ca4b1ef

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

wconstab added a commit that referenced this pull request Jul 16, 2025

[DTensor] Document redistribute_costs

6116c75

ghstack-source-id: fc49f31 Pull Request resolved: #158495

wconstab added the release notes: distributed (dtensor) release notes category label Jul 16, 2025

Update on "[DTensor] Document redistribute_costs"

54638ee

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

wconstab added a commit that referenced this pull request Jul 16, 2025

[DTensor] Document redistribute_costs

4e73a1f

ghstack-source-id: b88f375 Pull Request resolved: #158495

Update on "[DTensor] Document redistribute_costs"

4a6608e

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

wconstab added a commit that referenced this pull request Jul 17, 2025

[DTensor] Document redistribute_costs

21113ff

ghstack-source-id: e3b9b6c Pull Request resolved: #158495

wconstab mentioned this pull request Jul 17, 2025

[DTensor] fix copy_ strategy #158538

Closed

zpcore reviewed Jul 17, 2025

View reviewed changes

zpcore approved these changes Jul 17, 2025

View reviewed changes

wconstab added 2 commits July 17, 2025 12:38

Update on "[DTensor] Document redistribute_costs"

d85fd1b

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

Update on "[DTensor] Document redistribute_costs"

87b7510

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 17, 2025

XilunWu approved these changes Jul 17, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 17, 2025

pytorchmergebot added the Merged label Jul 18, 2025

pytorchmergebot closed this in ddbecdf Jul 18, 2025

pytorchmergebot removed the merging label Jul 18, 2025

github-actions bot deleted the gh/wconstab/430/head branch August 17, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DTensor] Document redistribute_costs#158495

[DTensor] Document redistribute_costs#158495
wconstab wants to merge 6 commits intogh/wconstab/430/basefrom
gh/wconstab/430/head

wconstab commented Jul 16, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 16, 2025 •

edited

Loading

Uh oh!

zpcore Jul 17, 2025

Uh oh!

zpcore left a comment

Uh oh!

wconstab commented Jul 17, 2025

Uh oh!

XilunWu left a comment

Uh oh!

pytorchmergebot commented Jul 17, 2025

Uh oh!

pytorchmergebot commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		0.0, # cost of redistributing tensor_a from 'Replicate()'
		K, # cost of redistributing tensor_a from 'Shard(0)'

Conversation

wconstab commented Jul 16, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158495

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

zpcore Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Jul 17, 2025

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Jul 17, 2025

Merge started

Uh oh!

pytorchmergebot commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Jul 16, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 16, 2025 •

edited

Loading