[DTensor] Fix default_strategy and rename for clarity#158490
[DTensor] Fix default_strategy and rename for clarity#158490wconstab wants to merge 7 commits intogh/wconstab/429/basefrom
Conversation
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object Renames to `propagate_single_input_strategy` since that's more descriptive [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158490
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Unrelated FailuresAs of commit 4f6998f with merge base 1e86fa2 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object Renames to `propagate_single_input_strategy` since that's more descriptive ghstack-source-id: 613b2d7 Pull Request resolved: #158490
| aten.zero_.default, | ||
| ] | ||
| )(default_strategy) | ||
| )(propagate_single_input_strategy) |
There was a problem hiding this comment.
Doesn't look like all ops here only have one input arg, e.g., aten.fill_.Scalar.
There was a problem hiding this comment.
i was wondering about that. but fill_.scalar should still only have one 'tensor' input, so i think my statement still holds and I just need to find a better way to check the schema?
There was a problem hiding this comment.
trying this instead
assert len([s for s in op_schema.args_schema if isinstance(s, OpStrategy)]) == 1
There was a problem hiding this comment.
Yes, one Tensor input should hold. LGTM now!
There was a problem hiding this comment.
Oh, wait, func: copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!) still have two.
There was a problem hiding this comment.
you're right. hmm. i don't really like using this strategy func for an op that has 2 tensor inputs.
the way i interpret that is, we actually have 2 tensor inputs, each with different strategies. Let's pick the first one arbitrarily, ignore the second one, and then create an op strategy where we use the first input's strategy in place of the second input.
Will this work for to_copy? I am suspicious that it only works in some cases.
Counter example:
If we feed an input tensor of shape (2,) into copy_, we can 'broadcast' it implicitly.
>>> import torch
>>> x = torch.ones([2,2])
>>> y = torch.ones([2,]) * 3
>>> x.copy_(y)
tensor([[3., 3.],
[3., 3.]])
If 'self' tensor (our first tensor input) has a sharding on dim 1, we can NOT use that sharding to describe 'src' tensor, whihch only has dim0. So it is not correct to use this strategy for copy_ in general.
There was a problem hiding this comment.
it also appears that we do not have a unit test for copy_ in the first place, but i added one that confirms that the assert len strategies == 1 will fail and we need a fix.
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object Renames to `propagate_single_input_strategy` since that's more descriptive cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object Renames to `propagate_single_input_strategy` since that's more descriptive cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object Renames to `propagate_single_input_strategy` since that's more descriptive cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
wanchaol
left a comment
There was a problem hiding this comment.
Could you explain more about the bug? I thought it could be easily fixed by dynamically generate redistribute_costs. This changes the intention of default_strategy and I wonder if it's necessary
| mesh=select_strategy.mesh, | ||
| placements=select_strategy.strategies[0].output_spec.placements, | ||
| tensor_meta=select_strategy.strategies[0].output_spec.tensor_meta, | ||
| def propagate_single_input_strategy(op_schema: OpSchema) -> StrategyType: |
There was a problem hiding this comment.
The default_strategy is not a single_input strategy, if you look at the old code, it actually loop through all the args_schema and append a new DTensorSpec for each argument, so it is meant be able to handle multiple input arguments.
There was a problem hiding this comment.
yes, i realized i was changing the semantic of the function. However, it was not clear if the original design was fully thought-through. 5/6 ops using it were single-input, and the other op that has multi inputs was using it incorrectly by accident.
Can you convince me that there are ops that we want to use the multi-input default strategy for and would be correct? If not, lets just make it simpler, and avoid the possibility of using it incorrectly by accident.
There was a problem hiding this comment.
Oh actually i think it's probably fine, it was added in this PR ae86e8f
Before that default strategy does not handle multiple inputs anyways.
| input_specs=[ | ||
| DTensorSpec( | ||
| mesh=first_input_strategy.mesh, | ||
| placements=strategy.output_spec.placements, |
There was a problem hiding this comment.
The intention of default strategy means to let every argument follow the input argument strategy, but here it looks like you only assume the op have a single input. Probably make sense for the renaming you have, but it was not the same intended behavior as before
There was a problem hiding this comment.
I found that out of all the ops that we used 'default_strategy' for, only copy_ was "benefitting" from applying the first input strategy to the second input.
however, for copy_, this supposed benefit was actually a hidden correctness bug. It works for cases where src,self are same dim, but it fails for cases where src is smaller dim than self and requires broadcasting. I added a new test case that will fail with the current default_strategy applied to copy.
Based on this, and since we do not have an example of an op that has multiple inputs that can safely share the first input's strategy, I decided its better to just simplify the helper to a narrower purpose. Wdyt?
| tensor_meta=strategy.output_spec.tensor_meta, | ||
| ), | ||
| input_specs=input_specs, | ||
| redistribute_cost=redistribute_cost, |
There was a problem hiding this comment.
aren't the real fix here should simply be generating the redistribute_cost here by calling generate_redistribute_costs?
There was a problem hiding this comment.
this fixes only the second bug in my PR desc. I'll explain the first bug more in a comment below.
Yes, the first bug in my PR desc is related to the input specs not the redistribute costs.
i found empirically that we were creating a strategy for Clone() where input strategy was Shard(0) and output strategy was Replicate(). I realized the reason was that we were for-looping over input strategies but hardcoding output strategy to be the one from the first input strategy. |
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object Renames to `propagate_single_input_strategy` since that's more descriptive cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object - Currently, using default_strategy is incorrect becuase it maps 'self' tensor's strategies directly onto 'src' tensor without accounting for the fact that copy_ supports broadcasting a smaller rank tensor into a larger one. Separates out copy_ op from default strategy, adds missing test case, but does not fix the underlying issue with copy_, leaves that for future PR Renames to `propagate_single_input_strategy` since that's more descriptive [ghstack-poisoned]
XilunWu
left a comment
There was a problem hiding this comment.
The idea that we need a single_operand_strategy looks good to me. Would like discuss more:
The pattern of "all DTensor input should follow a given input's sharding" seems common. I believe we still need it (of course it needs fix) with a new name. But I'm okay with removing it in this PR and add a correct version back in another since the only use case right now is copy_ and copy_ needs to handle broadcast.
BTW, do some foreach_* ops need single-input propagate and follow-some-input propagate as well?
| # Note: this may be a complete waste of work, becuase it should be equivalent to | ||
| # `return first_input_strategy` (unless creating a deep copy is important for some reason) |
There was a problem hiding this comment.
I don't think we're modifying OpSpec anywhere so it should be safe to directly return.
| num_tensor_args = 2 | ||
| first_input_strategy = op_schema.args_schema[0] | ||
| assert isinstance(first_input_strategy, OpStrategy) | ||
| return OpStrategy( | ||
| [ | ||
| OpSpec( | ||
| output_specs=DTensorSpec( | ||
| mesh=first_input_strategy.mesh, | ||
| placements=strategy.output_spec.placements, | ||
| tensor_meta=strategy.output_spec.tensor_meta, | ||
| ), | ||
| input_specs=[ | ||
| DTensorSpec( | ||
| mesh=first_input_strategy.mesh, | ||
| placements=strategy.output_spec.placements, | ||
| tensor_meta=strategy.output_spec.tensor_meta, | ||
| ) | ||
| for _ in range(num_tensor_args) | ||
| ], | ||
| redistribute_cost=[ | ||
| generate_redistribute_costs( | ||
| first_input_strategy, strategy.output_spec | ||
| ) | ||
| for _ in range(num_tensor_args) | ||
| ], | ||
| ) | ||
| for strategy in first_input_strategy.strategies | ||
| ] | ||
| ) |
There was a problem hiding this comment.
IIUC this doesn't address the broadcast issue either. Do we address it in another PR?
There was a problem hiding this comment.
see the next PR in the stack
foreach ops are different in that they are many to many, not many to one. the previous default_strategy was many to one. (one output). It would be bad to use 'default_strategy' for foreach, becuase each output needs to follow its own input, not the first input. |
can you give an example of an op where this pattern is true? |
XilunWu
left a comment
There was a problem hiding this comment.
The strategy part LGTM. Didn't review the copy_ part because they're override in next PR.
|
|
||
| # @pytest.mark.xfail | ||
| # @with_comms | ||
| # def test_copy_broadcast(self): |
There was a problem hiding this comment.
why this being comment out? I thought you are marking xfails
There was a problem hiding this comment.
xfail did not work haha. so i commented it. its uncommented and passing in the next PR.
| mesh=select_strategy.mesh, | ||
| placements=select_strategy.strategies[0].output_spec.placements, | ||
| tensor_meta=select_strategy.strategies[0].output_spec.tensor_meta, | ||
| def propagate_single_input_strategy(op_schema: OpSchema) -> StrategyType: |
There was a problem hiding this comment.
Oh actually i think it's probably fine, it was added in this PR ae86e8f
Before that default strategy does not handle multiple inputs anyways.
Renamed after changing the semantic upstream pytorch/pytorch#158490
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. ``` self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) ``` These are the correct sharding combinations: | self | src | |-------|------| | Shard(0) | Replicate() | | Shard(1) | Replicate() | | Shard(2) | Shard(0) | | Shard(3) | Shard(1) | Pull Request resolved: #158538 Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #158495, #158490
|
@pytorchbot revert -m "broke lint? GH job link HUD commit link" -c landrace cc @seemethere tentatively marking this as landrace but the misspelling does exist in the PR as well |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 7b05bdd. Reverted #158538 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/d8b084312b54e97bdbaf6a178fe2fc628a23243b) ([comment](#158490 (comment)))
This reverts commit d8b0843. Reverted #158490 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/d8b084312b54e97bdbaf6a178fe2fc628a23243b) ([comment](#158490 (comment)))
|
@wconstab your PR has been successfully reverted. |
Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object - Currently, using default_strategy is incorrect becuase it maps 'self' tensor's strategies directly onto 'src' tensor without accounting for the fact that copy_ supports broadcasting a smaller rank tensor into a larger one. Separates out copy_ op from default strategy, adds missing test case, but does not fix the underlying issue with copy_, leaves that for future PR Renames to `propagate_single_input_strategy` since that's more descriptive cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]
|
Starting merge as part of PR stack under #158538 |
Renamed after changing the semantic upstream pytorch/pytorch#158490
|
Starting merge as part of PR stack under #158538 |
The previous strategy directly used 'self' input strategy for 'src' input. The fixed strategy correctly maps the self dim to src dim so that it works even if the src input is broadcast. E.g. for this program, broadcasting will occur on dims 0,1,3 of self. ``` self = torch.ones((2,3,4,5)) src = torch.ones((4,1)) self.copy_(src) ``` These are the correct sharding combinations: | self | src | |-------|------| | Shard(0) | Replicate() | | Shard(1) | Replicate() | | Shard(2) | Shard(0) | | Shard(3) | Shard(1) | Pull Request resolved: #158538 Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #158490
Renamed after changing the semantic upstream pytorch/pytorch#158490
Stack from ghstack (oldest at bottom):
Fixes several bugs in the original.
by mixing input_specs that were frozen from
select_strategy.strategies[0] with output_specs that varied across
select_strategy.strategies[0..N] (e.g. we could create a nonsense
strategy like input:Shard(0) output(Replicate) for an op like clone
should be the cost of redistributing our single input from another
strategy to the current strategy, in our list of output strategies
input strategy instead of creating this new object
tensor's strategies directly onto 'src' tensor without accounting for
the fact that copy_ supports broadcasting a smaller rank tensor into a
larger one.
Separates out copy_ op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR
Renames to
propagate_single_input_strategysince that's moredescriptive
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta