[DTensor] Fix _to_copy to reduce Partial before non-linear dtype conversions#172696
[DTensor] Fix _to_copy to reduce Partial before non-linear dtype conversions#172696stmcgovern wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172696
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit f03d781 with merge base 007b6a4 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Shouldn't this have a test? |
b083109 to
a8083cd
Compare
| target_dtype = cast(torch.dtype | None, op_schema.kwargs_schema.get("dtype", None)) | ||
|
|
||
| strategies = [] | ||
| for strategy in first_input_strategy.strategies: |
There was a problem hiding this comment.
not specific to this PR, but in general we're trying to move away from the style of 'find existing input strategies, and modify/filter/tweak them' and instead use 'single mesh dim' strategies. The way a single-dim strategy would work is you just worry about a single mesh dim and you get TensorMetas for your tensor input arguments and then list out all the viable placements. Infra will take care of expanding this combinatorially to the real mesh.
I think the single_dim strategy equivalent of your PR would look something like
@register_single_dim_strategy(aten._to_copy.default, schema_info=RuntimeSchemaInfo(static_kwargkey=["dtype"])
def _to_copy_single_dim_strategy(
op: OpOverload, args_schema: ArgsType, kwargs_schema: KwargsType
) -> list[list[Placement | _ShardingPlaceholder]]:
strategies = []
input = args_schema[0]
# don't bother adding the all-replicate strategy, infra adds it for you
for i in range(input.dim()):
strategies.append(_ShardingPlaceholder(i), _ShardingPlaceholder(i))
for partial in (Partial("sum"), Partial("avg"), ...):
if(dtype allows partial):
strategies.append(partial, partial)
return strategies
cc @anshul-si - maybe we should add some helper for nicely iterating the partial types for cases like these?
There was a problem hiding this comment.
@stmcgovern were you planning to continue on this PR as single-dim, or do you want me to review this further as-is?
There was a problem hiding this comment.
yes, I'll switch it over to single-dim. Got sidetracked with the squeeze issues.
|
|
||
| else: | ||
| # Unknown reduce_op, be conservative | ||
| return True |
There was a problem hiding this comment.
I suppose we should just work out the cases and be complete, other than NormPartial which we plan to deprecate. cc @anshul-si who is helping with 'all the partial stuff'
There was a problem hiding this comment.
this should be addressed now since caller iterates Partial.ALL_REDUCE_OPS
a8083cd to
39e73a1
Compare
| aten._to_copy.default, schema_info=RuntimeSchemaInfo(static_kwargkey=["dtype"]) | ||
| )(propagate_single_input_strategy) | ||
| ) | ||
| def _to_copy_single_dim_strategy( |
There was a problem hiding this comment.
@anshul-si does this conflict with your pointwise rewrite?
| def test_to_copy_partial_reduces_for_nonlinear_cast(self): | ||
| # (reduce_op, target_dtype, expect_partial) | ||
| cases = [ | ||
| ("sum", torch.int32, False), # truncation breaks additivity |
There was a problem hiding this comment.
@pianpwk does this address the thing you found with strategy validation claiming incorrect rules on various ops that decompose to to_copies?
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
39e73a1 to
eed4c51
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
eed4c51 to
e41f21a
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert This PR is attributed to have caused regression in: Please investigate and fix the issues. |
|
Claude finished @pytorch-auto-revert[bot]'s task in 3m 57s —— View job Revert Diagnostics for PR #172696
Failure SummaryThe test at Root Cause AnalysisThis is a legitimate revert. The PR changed Here's the chain of events:
How to FixThe author has a few options:
Option 1 is likely the simplest and safest fix — the PR's correctness concern only applies when there's a non-linear dtype conversion, so device-only |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…ype conversions (#172696)" This reverts commit 46cd90c. Reverted #172696 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#172696 (comment)))
|
@stmcgovern your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
|
@claude can you explain the fsdp issue in more detail? give an example of the sequence of operations in the before (working) and after (broken) case including sizes/metadata at each step? Is FSDP abusing an odd behavior of the old rule or is the new rule overly simple? cc @weifengpy in case you have any suggestion |
|
Claude finished @wconstab's task in 8m 39s —— View job
Detailed Analysis: Before (Working) vs After (Broken)Concrete scenario: Setup: How FSDP Creates the Sharded ParameterGlobal weight: FSDP shards on dim 0 with
The local tensor of the DTensor is The DTensor's Before (Working):
|
Claude's suggestion on option 1 seems reasonable to me. @anshul-si brough up whether |
Rewrite _to_copy strategy using register_single_dim_strategy. The old propagate_single_input_strategy unconditionally passed through all placements, which is incorrect when the dtype conversion does not commute with the reduce op (e.g. trunc(0.6+0.6)=1 ≠ trunc(0.6)+trunc(0.6)=0).
FSDP creates uneven shards (dim < world_size) with padding. Without allow_uneven_sharding, the single_dim_strategy infra rejects Shard placements via is_tensor_shardable, forcing a redistribution to Replicate that breaks reset_sharded_param().
e41f21a to
f03d781
Compare
|
@pytorchbot merge |
|
This PR needs to be approved by an authorized maintainer before merge. |
|
@wconstab Would you please rereview/approve. I added the recommended FSDP fix. The failed CI test is an unrelated timeout. |
Fixes #172684
Updated to use single_dim_strategy.
Type conversion to int/bool on Partial(sum) incorrectly preserved the Partial placement, producing wrong results. trunc(a+b) != trunc(a) + trunc(b).
This adds a custom strategy for _to_copy that checks if the dtype conversion is linear for the reduce operation before preserving Partial.
This PR is offered in support of the Partial correctness stabilization efforts.