[DTensor] Prevent squeeze from redistributing with strict_view#175798
[DTensor] Prevent squeeze from redistributing with strict_view#175798stmcgovern wants to merge 3 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175798
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ce19418 with merge base 9daaaa2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
e2dfacb to
dd7b23b
Compare
|
@claude review this PR |
|
Claude finished @wconstab's task in 4m 38s —— View job PR Review: #175798SummaryThis PR fixes a real bug where Code QualityThe change is well-scoped and the generalization from a single op check ( One observation worth discussing:
Testing
The tests are well-structured and directly target the bug. Two suggestions:
SecurityNo security concerns identified. Thread SafetyNo thread safety concerns — this code path is per-op dispatch, not shared mutable state. Backward CompatibilityNo BC-breaking changes. This is purely a bug fix:
The generalization from PerformanceNo performance concerns. The RecommendationApprove This is a correct, well-targeted bug fix with good test coverage. The generalization from a single-op special case to tag-based dispatch is the right design choice. The only minor suggestion is adding a |
| # use return_and_correct_aliasing to match the outer and the inner | ||
| # aliasing. See https://github.com/pytorch/pytorch/pull/158954 | ||
| if not isinstance(local_results, torch.Tensor): | ||
| raise AssertionError |
There was a problem hiding this comment.
nit: add text about what type the local result is
| # update the spec to handle tensor meta changes | ||
| if torch.Tag.inplace_view in op_call.tags: | ||
| # Inplace view ops may change tensor meta. Update both | ||
| # spec and local tensor -- redistribution may produce a |
There was a problem hiding this comment.
nit: comment says 'redistribution may produce..' but i think in this path a redistribution is actually not happening- its purely a metadata change right?
| dt = distribute_tensor(x, mesh, [Shard(0)]) | ||
| dt.squeeze_(0) | ||
| self.assertEqual(dt.shape, torch.Size([4])) | ||
| self.assertEqual(dt._local_tensor.shape, torch.Size([4])) |
There was a problem hiding this comment.
now i'm wondering- is this squeeze actually doing a redistribution? or just changing placements (aka view operation)? lets add comm counter to disambiguate this in the test. (and update the comment above if needed)
Respond to wconstab's review comments on PR pytorch#175798: - Clarify comment about when/why local tensor writeback is needed - Add error message to AssertionError for debuggability - Add CommDebugMode assertions to verify allgather happens for Shard->Replicate squeeze and zero comms for no-op squeeze
dd7b23b to
071ed37
Compare
Respond to wconstab's review comments on PR pytorch#175798: - Clarify comment about when/why local tensor writeback is needed - Add error message to AssertionError for debuggability - Add CommDebugMode assertions to verify allgather happens for Shard->Replicate squeeze and zero comms for no-op squeeze
071ed37 to
7ac3e4c
Compare
Respond to wconstab's review comments on PR pytorch#175798: - Clarify comment about when/why local tensor writeback is needed - Add error message to AssertionError for debuggability - Add CommDebugMode assertions to verify allgather happens for Shard->Replicate squeeze and zero comms for no-op squeeze
d88c851 to
67adb02
Compare
|
reordered the checks to address the benchmark regression CI failure. |
|
@wconstab this one is ready for review when you have the chance |
| dt = distribute_tensor(x, mesh, [Shard(0)]) | ||
| comm_mode = CommDebugMode() | ||
| with comm_mode: | ||
| dt.squeeze_(0) |
There was a problem hiding this comment.
related to my original comment, it seems like this PR is going against the design for dtensor views. Views are supposed to be gauranteed fast and sharing same memory as the original tensor, so doing a redistribution implicitly during a view operation is semantics breaking even if it's convenient in some cases. We prefer to error as unsupported in such cases and require the user intervene to do a redistribution first manually and then perform a true view operation
There was a problem hiding this comment.
I see. Yes this PR is cleaning up after a redistribution. But that redistribution should be banned. I'll use strict_view=True for squeeze, following #149764
856a96d to
85dc8d8
Compare
0885f50 to
364f8d0
Compare
12736e2 to
ff4769b
Compare
ff4769b to
8582689
Compare
8582689 to
9b58744
Compare
|
these failures seem unrelated. going to try to merge |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud |
… allgather Squeeze ops are views and must be communication-free. Enable strict_view=True for all squeeze registrations so squeezing a sharded singleton dim raises instead of silently allgathering. Generalize the inplace dispatch handler to check the redistribute schema for actual placement changes rather than special-casing squeeze_ ops by name. Fixes pytorch#174136
Use placement tuple comparison instead of DTensorSpec.__eq__ on the inplace fast path. DTensorSpec.__eq__ compares mesh, placements, shard_order, shape, stride, and dtype — unnecessary for the fast-path decision where only placements matter. This restores the instruction count to match main (56530 expected vs 66916 with __eq__).
…guard Remove the needs_redistribute check for inplace ops in dispatch. With strict_view=True on squeeze ops, sharding propagation already errors when an inplace view would require redistribution of arg 0. The remaining code handles two cases: 1. Placements unchanged (add_, mul_, etc.) — return self 2. Placements reindexed (squeeze_ removing non-sharded dim: Shard(1) → Shard(0)) — update spec, no communication
9b58744 to
ce19418
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #174136
When squeeze_.dim triggers redistribution (e.g. squeezing a globally-singleton sharded dim forces Shard(0) → Replicate), redistribute_local_args creates a new local tensor. The dispatch path updated
_specbut never assigned the new tensor back to_local_tensor, leaving the DTensor in an inconsistent state where the spec claims shape [4] but the local tensor is still [1, 4].Update: Prevent squeeze, as a view op, from doing a redistribution using the
view_strictkeyword.This PR syncs_specand_local_tensor. It uses the inplace_view tag, so other inplace view ops (unsqueeze_, t_, transpose_) won't hit this, when strategies are registered for them.The C++ fast path in dispatchDTensorOp explicitly skips inplace ops, so no C++ change is needed.