Skip to content

Fix inplace ops on Partial DTensors to preserve aliasing semantics#164729

Closed
RohitRathore1 wants to merge 5 commits intopytorch:mainfrom
RohitRathore1:inplace_ops_new
Closed

Fix inplace ops on Partial DTensors to preserve aliasing semantics#164729
RohitRathore1 wants to merge 5 commits intopytorch:mainfrom
RohitRathore1:inplace_ops_new

Conversation

@RohitRathore1
Copy link
Collaborator

@RohitRathore1 RohitRathore1 commented Oct 6, 2025

Fixes #163374.

Here is the output from reproducible code:

W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] 
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
  aten::clamp_(dt: f32[][R], None, 2)
    redistribute_input(0, [P] -> [R])
      redistribute_input(t: f32[], [P] -> [R])
        _c10d_functional::all_reduce(t: f32[], sum, 0)
        _c10d_functional::wait_tensor(t: f32[])
    aten::clamp_(t: f32[], None, 2)
    aten::view(t: f32[], [])
(Replicate(),)
tensor(2., device='cuda:0')

The behavior is now matching what you were expecting in issue #163374:

Expected behavior (from the issue):

  1. Placement should change from Partial(sum) to Replicate()
  2. Value should be tensor(2.) instead of tensor(144.)

Actual output from this build:

  1. (Replicate(),) - placement is correct
  2. tensor(2., device='cuda:0') - value is correct

so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value.

cc: @SherlockNoMad @ezyang @janeyx99 @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164729

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ec907a6 with merge base 573a79f (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 6, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2025
@RohitRathore1
Copy link
Collaborator Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 6, 2025
@janeyx99 janeyx99 requested a review from tianyu-l October 7, 2025 21:56
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 7, 2025
@fduwjj fduwjj requested review from XilunWu and zpcore October 8, 2025 18:27
@fduwjj
Copy link
Contributor

fduwjj commented Oct 8, 2025

lgtm but I also cced two more DTensor people. If they have not reviewed it by next week, I will stamp on this PR.

@RohitRathore1
Copy link
Collaborator Author

@fduwjj @zpcore if everything looks good here then please can you approve and merge this PR as it has passed all the tests, thanks?

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. The problem is if there is a view into the partial tensor; just overwriting the local tensor is not enough to cause the views to update. In fact, I tend to think you are just toast if you need a redistribute on the tensor being inplace'd. I'd look more positively on something that just errors when this happens.

@RohitRathore1
Copy link
Collaborator Author

RohitRathore1 commented Oct 15, 2025

I don't think this is right. The problem is if there is a view into the partial tensor; just overwriting the local tensor is not enough to cause the views to update. In fact, I tend to think you are just toast if you need a redistribute on the tensor being inplace'd. I'd look more positively on something that just errors when this happens.

Thanks for the review @ezyang! You're absolutely right about the aliasing issue. I hadn't considered the case where there might be views into the partial tensor i.e., simply overwriting _local_tensor would break those views. So instead of trying to fix this with _local_tensor assignment, I should just error out when an inplace operation requires redistribution. If this approach sounds good, let me update the PR to raise a RuntimeError in this case and adjust the test accordingly. Should the error be raised for all inplace ops that need redistribution, or are there specific cases where it would be safe?
cc: @fduwjj @zpcore @XilunWu

@RohitRathore1
Copy link
Collaborator Author

@ezyang should I go ahead with this fix?

@ezyang
Copy link
Contributor

ezyang commented Oct 17, 2025

sure

@RohitRathore1
Copy link
Collaborator Author

sure

@ezyang Thanks, I have updated my PR.


# update the spec for all inplace ops to handle placement changes
# that don't require redistribution
args[0]._spec = output_spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still OK (and in fact preferred) for the squeeze_ logic to live inside the squeeze_ branch. Most inplace ops do NOT change the tensor meta. So because no redistribution could have occurred, the input cannot have a placement change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ezyang! You're absolutely right. I'will move the args[0]._spec = output_spec inside the squeeze_.dim branch.

@ezyang
Copy link
Contributor

ezyang commented Oct 27, 2025

Just a small tweak please, thanks

@RohitRathore1
Copy link
Collaborator Author

Just a small tweak please, thanks

Done, thanks!

@ezyang
Copy link
Contributor

ezyang commented Oct 29, 2025

need to update tests

@RohitRathore1
Copy link
Collaborator Author

need to update tests

@ezyang you are referring to failing test or do I need to update existing test?

@ezyang
Copy link
Contributor

ezyang commented Nov 3, 2025

your ci failed, fix it

@RohitRathore1
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased inplace_ops_new onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout inplace_ops_new && git pull --rebase)

@RohitRathore1
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Per @ezyang's feedback, inplace operations that require redistribution
(e.g., Partial -> Replicate) now raise a RuntimeError instead of trying
to work around the issue by overwriting _local_tensor.
@pytorchmergebot
Copy link
Collaborator

Successfully rebased inplace_ops_new onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout inplace_ops_new && git pull --rebase)

@RohitRathore1
Copy link
Collaborator Author

@ezyang I've fixed the CI failure. The issue was that the placement change check was happening before the squeeze_.dim special case, so I moved it into the else block to exempt squeeze_.dim. I also removed the spec update for other inplace ops as you mentioned since most inplace ops don't change tensor meta and no redistribution occurs when placements match, the spec update isn't needed. Let me know if there's anything else that needs to be addressed.

@ezyang
Copy link
Contributor

ezyang commented Nov 14, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…ytorch#164729)

Fixes pytorch#163374.

Here is the output from reproducible code:

```
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811]
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] *****************************************
  aten::clamp_(dt: f32[][R], None, 2)
    redistribute_input(0, [P] -> [R])
      redistribute_input(t: f32[], [P] -> [R])
        _c10d_functional::all_reduce(t: f32[], sum, 0)
        _c10d_functional::wait_tensor(t: f32[])
    aten::clamp_(t: f32[], None, 2)
    aten::view(t: f32[], [])
(Replicate(),)
tensor(2., device='cuda:0')
```

The behavior is now matching what you were expecting in issue pytorch#163374:

Expected behavior (from the issue):
  1. Placement should change from Partial(sum) to Replicate()
  2. Value should be tensor(2.) instead of tensor(144.)

  Actual output from this build:
  1. (Replicate(),) - placement is correct
  2. tensor(2., device='cuda:0') - value is correct

so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value.

Pull Request resolved: pytorch#164729
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DTensor] Inplace ops preduces wrong result

7 participants