Skip to content

[DTensor] fix copy_ strategy to support linearity#162460

Closed
tianyu-l wants to merge 1 commit intogh/tianyu-l/5/basefrom
gh/tianyu-l/5/head
Closed

[DTensor] fix copy_ strategy to support linearity#162460
tianyu-l wants to merge 1 commit intogh/tianyu-l/5/basefrom
gh/tianyu-l/5/head

Conversation

@tianyu-l
Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l commented Sep 9, 2025

Stack from ghstack (oldest at bottom):

Fixing issue introduced in #158538
where aten.copy_.default is registered as a pointwise op, but without linearity.

In particular, when both src and dst tensors have same Partial placements, direct copy should happen without redistribute, instead of redistributing both to Replicate before making the copy.

This was discovered from silent incorrect results e.g. on torch.einsum backward.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Sep 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162460

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 93d92fa with merge base 4840a1a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 9, 2025
tianyu-l added a commit that referenced this pull request Sep 9, 2025
ghstack-source-id: 2c80ec6
Pull Request resolved: #162460
@tianyu-l tianyu-l requested review from ezyang and wanchaol September 9, 2025 06:16
@tianyu-l tianyu-l added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category release notes: distributed (dtensor) release notes category labels Sep 9, 2025
@tianyu-l tianyu-l requested review from XilunWu and zpcore September 9, 2025 06:18
Comment on lines +104 to +105
dst_dtensor.copy_(src_dtensor)
dst_tensor.copy_(src_tensor)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, shouldn't dst_dtensor.copy_(src_dtensor) already modified dst_tensor? I think we don't need dst_tensor.copy_(src_tensor). The same pattern also appeared in above tests though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting! I tried but it didn't just copy to dst_tensor. Maybe let's land this PR and follow offline.

Copy link
Copy Markdown
Member

@zpcore zpcore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a comment for the test.

@tianyu-l
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wconstab
Copy link
Copy Markdown
Contributor

nice catch. I intuitively thought linearity ought to have nothing to do with copy.

the silent incorrectness part bothers me- is that becuase we copied src into a replicated copy of dst and then threw that away without modifying the original dst?

if someone tried to copy_ from a partial of one type into a partial of another type, we'd still need to do replication for correctness sake, would this PR still have the above bug in that case?

@tianyu-l tianyu-l deleted the gh/tianyu-l/5/head branch September 10, 2025 04:37
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Fixing issue introduced in pytorch#158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.

In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.

This was discovered from silent incorrect results e.g. on `torch.einsum` backward.

Pull Request resolved: pytorch#162460
Approved by: https://github.com/zpcore
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Fixing issue introduced in pytorch#158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.

In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.

This was discovered from silent incorrect results e.g. on `torch.einsum` backward.

Pull Request resolved: pytorch#162460
Approved by: https://github.com/zpcore
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Fixing issue introduced in pytorch#158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.

In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.

This was discovered from silent incorrect results e.g. on `torch.einsum` backward.

Pull Request resolved: pytorch#162460
Approved by: https://github.com/zpcore
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Fixing issue introduced in pytorch#158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.

In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.

This was discovered from silent incorrect results e.g. on `torch.einsum` backward.

Pull Request resolved: pytorch#162460
Approved by: https://github.com/zpcore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (dtensor) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants