Skip to content

[DTensor] Fix default_strategy and rename for clarity#158490

Closed
wconstab wants to merge 7 commits intogh/wconstab/429/basefrom
gh/wconstab/429/head
Closed

[DTensor] Fix default_strategy and rename for clarity#158490
wconstab wants to merge 7 commits intogh/wconstab/429/basefrom
gh/wconstab/429/head

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Jul 16, 2025

Stack from ghstack (oldest at bottom):

Fixes several bugs in the original.

  • foremost, fixes a serious bug where we returned incorrect strategies
    by mixing input_specs that were frozen from
    select_strategy.strategies[0] with output_specs that varied across
    select_strategy.strategies[0..N] (e.g. we could create a nonsense
    strategy like input:Shard(0) output(Replicate) for an op like clone
  • fixes the redistribute costs: they should not actually be 0, they
    should be the cost of redistributing our single input from another
    strategy to the current strategy, in our list of output strategies
  • adds a note, wondering if we should have just literally returned the
    input strategy instead of creating this new object
  • Currently, using default_strategy is incorrect becuase it maps 'self'
    tensor's strategies directly onto 'src' tensor without accounting for
    the fact that copy_ supports broadcasting a smaller rank tensor into a
    larger one.

Separates out copy_ op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to propagate_single_input_strategy since that's more
descriptive

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta

Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object

Renames to `propagate_single_input_strategy` since that's more
descriptive

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158490

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 4f6998f with merge base 1e86fa2 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Jul 16, 2025
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object

Renames to `propagate_single_input_strategy` since that's more
descriptive

ghstack-source-id: 613b2d7
Pull Request resolved: #158490
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 16, 2025
@wconstab wconstab added the release notes: distributed (dtensor) release notes category label Jul 16, 2025
aten.zero_.default,
]
)(default_strategy)
)(propagate_single_input_strategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like all ops here only have one input arg, e.g., aten.fill_.Scalar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was wondering about that. but fill_.scalar should still only have one 'tensor' input, so i think my statement still holds and I just need to find a better way to check the schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying this instead
assert len([s for s in op_schema.args_schema if isinstance(s, OpStrategy)]) == 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, one Tensor input should hold. LGTM now!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait, func: copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!) still have two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. hmm. i don't really like using this strategy func for an op that has 2 tensor inputs.

the way i interpret that is, we actually have 2 tensor inputs, each with different strategies. Let's pick the first one arbitrarily, ignore the second one, and then create an op strategy where we use the first input's strategy in place of the second input.

Will this work for to_copy? I am suspicious that it only works in some cases.

Counter example:
If we feed an input tensor of shape (2,) into copy_, we can 'broadcast' it implicitly.

>>> import torch
>>> x = torch.ones([2,2])
>>> y = torch.ones([2,]) * 3
>>> x.copy_(y)
tensor([[3., 3.],
        [3., 3.]])

If 'self' tensor (our first tensor input) has a sharding on dim 1, we can NOT use that sharding to describe 'src' tensor, whihch only has dim0. So it is not correct to use this strategy for copy_ in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also appears that we do not have a unit test for copy_ in the first place, but i added one that confirms that the assert len strategies == 1 will fail and we need a fix.

wconstab added 3 commits July 16, 2025 15:26
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object

Renames to `propagate_single_input_strategy` since that's more
descriptive

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object

Renames to `propagate_single_input_strategy` since that's more
descriptive

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object

Renames to `propagate_single_input_strategy` since that's more
descriptive

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about the bug? I thought it could be easily fixed by dynamically generate redistribute_costs. This changes the intention of default_strategy and I wonder if it's necessary

mesh=select_strategy.mesh,
placements=select_strategy.strategies[0].output_spec.placements,
tensor_meta=select_strategy.strategies[0].output_spec.tensor_meta,
def propagate_single_input_strategy(op_schema: OpSchema) -> StrategyType:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default_strategy is not a single_input strategy, if you look at the old code, it actually loop through all the args_schema and append a new DTensorSpec for each argument, so it is meant be able to handle multiple input arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i realized i was changing the semantic of the function. However, it was not clear if the original design was fully thought-through. 5/6 ops using it were single-input, and the other op that has multi inputs was using it incorrectly by accident.

Can you convince me that there are ops that we want to use the multi-input default strategy for and would be correct? If not, lets just make it simpler, and avoid the possibility of using it incorrectly by accident.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually i think it's probably fine, it was added in this PR ae86e8f

Before that default strategy does not handle multiple inputs anyways.

input_specs=[
DTensorSpec(
mesh=first_input_strategy.mesh,
placements=strategy.output_spec.placements,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of default strategy means to let every argument follow the input argument strategy, but here it looks like you only assume the op have a single input. Probably make sense for the renaming you have, but it was not the same intended behavior as before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that out of all the ops that we used 'default_strategy' for, only copy_ was "benefitting" from applying the first input strategy to the second input.

however, for copy_, this supposed benefit was actually a hidden correctness bug. It works for cases where src,self are same dim, but it fails for cases where src is smaller dim than self and requires broadcasting. I added a new test case that will fail with the current default_strategy applied to copy.

Based on this, and since we do not have an example of an op that has multiple inputs that can safely share the first input's strategy, I decided its better to just simplify the helper to a narrower purpose. Wdyt?

tensor_meta=strategy.output_spec.tensor_meta,
),
input_specs=input_specs,
redistribute_cost=redistribute_cost,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't the real fix here should simply be generating the redistribute_cost here by calling generate_redistribute_costs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes only the second bug in my PR desc. I'll explain the first bug more in a comment below.

@wconstab
Copy link
Contributor Author

Could you explain more about the bug? I thought it could be easily fixed by dynamically generate redistribute_costs. This changes the intention of default_strategy and I wonder if it's necessary

Yes, the first bug in my PR desc is related to the input specs not the redistribute costs.

(e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone

i found empirically that we were creating a strategy for Clone() where input strategy was Shard(0) and output strategy was Replicate(). I realized the reason was that we were for-looping over input strategies but hardcoding output strategy to be the one from the first input strategy.

wconstab added 2 commits July 17, 2025 12:38
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object

Renames to `propagate_single_input_strategy` since that's more
descriptive

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k

[ghstack-poisoned]
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

[ghstack-poisoned]
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea that we need a single_operand_strategy looks good to me. Would like discuss more:

The pattern of "all DTensor input should follow a given input's sharding" seems common. I believe we still need it (of course it needs fix) with a new name. But I'm okay with removing it in this PR and add a correct version back in another since the only use case right now is copy_ and copy_ needs to handle broadcast.

BTW, do some foreach_* ops need single-input propagate and follow-some-input propagate as well?

Comment on lines +45 to +46
# Note: this may be a complete waste of work, becuase it should be equivalent to
# `return first_input_strategy` (unless creating a deep copy is important for some reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're modifying OpSpec anywhere so it should be safe to directly return.

Comment on lines +105 to +133
num_tensor_args = 2
first_input_strategy = op_schema.args_schema[0]
assert isinstance(first_input_strategy, OpStrategy)
return OpStrategy(
[
OpSpec(
output_specs=DTensorSpec(
mesh=first_input_strategy.mesh,
placements=strategy.output_spec.placements,
tensor_meta=strategy.output_spec.tensor_meta,
),
input_specs=[
DTensorSpec(
mesh=first_input_strategy.mesh,
placements=strategy.output_spec.placements,
tensor_meta=strategy.output_spec.tensor_meta,
)
for _ in range(num_tensor_args)
],
redistribute_cost=[
generate_redistribute_costs(
first_input_strategy, strategy.output_spec
)
for _ in range(num_tensor_args)
],
)
for strategy in first_input_strategy.strategies
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this doesn't address the broadcast issue either. Do we address it in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the next PR in the stack

@wconstab
Copy link
Contributor Author

BTW, do some foreach_* ops need single-input propagate and follow-some-input propagate as well?

foreach ops are different in that they are many to many, not many to one. the previous default_strategy was many to one. (one output). It would be bad to use 'default_strategy' for foreach, becuase each output needs to follow its own input, not the first input.

@wconstab
Copy link
Contributor Author

The pattern of "all DTensor input should follow a given input's sharding" seems common. I believe we still need it (of course it needs fix) with a new name. But I'm okay with removing it in this PR and add a correct version back in another since the only use case right now is copy_ and copy_ needs to handle broadcast.

can you give an example of an op where this pattern is true?

Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy part LGTM. Didn't review the copy_ part because they're override in next PR.


# @pytest.mark.xfail
# @with_comms
# def test_copy_broadcast(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this being comment out? I thought you are marking xfails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xfail did not work haha. so i commented it. its uncommented and passing in the next PR.

mesh=select_strategy.mesh,
placements=select_strategy.strategies[0].output_spec.placements,
tensor_meta=select_strategy.strategies[0].output_spec.tensor_meta,
def propagate_single_input_strategy(op_schema: OpSchema) -> StrategyType:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually i think it's probably fine, it was added in this PR ae86e8f

Before that default strategy does not handle multiple inputs anyways.

wconstab added a commit to meta-pytorch/autoparallel that referenced this pull request Jul 17, 2025
Renamed after changing the semantic upstream
pytorch/pytorch#158490
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 18, 2025
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: #158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
@clee2000
Copy link
Contributor

@pytorchbot revert -m "broke lint? GH job link HUD commit link" -c landrace

cc @seemethere tentatively marking this as landrace but the misspelling does exist in the PR as well

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@wconstab your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 18, 2025
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #158538

wconstab added a commit to meta-pytorch/autoparallel that referenced this pull request Jul 18, 2025
Renamed after changing the semantic upstream
pytorch/pytorch#158490
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #158538

pytorchmergebot pushed a commit that referenced this pull request Jul 18, 2025
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: #158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158490
fmassa pushed a commit to meta-pytorch/autoparallel that referenced this pull request Jul 19, 2025
@github-actions github-actions bot deleted the gh/wconstab/429/head branch August 18, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants