Skip to content

[DTensor] Document redistribute_costs#158495

Closed
wconstab wants to merge 6 commits intogh/wconstab/430/basefrom
gh/wconstab/430/head
Closed

[DTensor] Document redistribute_costs#158495
wconstab wants to merge 6 commits intogh/wconstab/430/basefrom
gh/wconstab/430/head

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Jul 16, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158495

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 87b7510 with merge base 900fba4 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 16, 2025
wconstab added a commit that referenced this pull request Jul 16, 2025
ghstack-source-id: eb3da25
Pull Request resolved: #158495
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Jul 16, 2025
ghstack-source-id: fc49f31
Pull Request resolved: #158495
@wconstab wconstab added the release notes: distributed (dtensor) release notes category label Jul 16, 2025
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Jul 16, 2025
ghstack-source-id: b88f375
Pull Request resolved: #158495
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Jul 17, 2025
ghstack-source-id: e3b9b6c
Pull Request resolved: #158495
Comment on lines +103 to +104
0.0, # cost of redistributing tensor_a from 'Replicate()'
K, # cost of redistributing tensor_a from 'Shard(0)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:
0.0, # cost of redistributing tensor_a from Replicate() -> Replicate()
K, # cost of redistributing tensor_a from 'Shard(0)' -> Replicate()

Copy link
Member

@zpcore zpcore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

wconstab added 2 commits July 17, 2025 12:38
cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 17, 2025
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #158490

pytorchmergebot pushed a commit that referenced this pull request Jul 18, 2025
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: #158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #158495
pytorchmergebot pushed a commit that referenced this pull request Jul 18, 2025
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: #158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
@github-actions github-actions bot deleted the gh/wconstab/430/head branch August 17, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants