Skip to content

Fix longstanding race condition around getAllOperatorsFor#167860

Closed
swolchok wants to merge 1 commit intogh/swolchok/871/basefrom
gh/swolchok/871/head
Closed

Fix longstanding race condition around getAllOperatorsFor#167860
swolchok wants to merge 1 commit intogh/swolchok/871/basefrom
gh/swolchok/871/head

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Nov 14, 2025

Stack from ghstack (oldest at bottom):

getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions.

This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.

Differential Revision: D87088731

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion), this is a straightforward way to cause callers to create race conditions.

This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.

Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)!

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167860

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8d169ad with merge base 2aba180 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request Nov 14, 2025
getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion), this is a straightforward way to cause callers to create race conditions.

This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.

Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)!

ghstack-source-id: 323363321
Pull Request resolved: #167860
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 14, 2025
@pytorch-bot pytorch-bot bot removed the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

The label oncall: jit is only applicable to issues and has been removed. Please only use this label on issues.

// the "first" op may change depending on registration order.
std::vector<std::shared_ptr<Operator>> sortedOps;
sortedOps.reserve(unsortedOps.size());
std::copy_if(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really faster than a stable Boolean sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes no judgement on that; this is existing code that we're not changing. (however I note that this is two passes over the input whereas a sort is going to take N log N)

}

const std::vector<std::shared_ptr<Operator>>& getOperatorsWithLockHeld(
Symbol name) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is symbol really trivial to take by value or should this be const ref?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it fits in a register and has no copy ctor

auto it = operators.find(name);
if (it != operators.end())
return it->second;
return empty;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh this is weird. If only we had constexpr std::vector before C++20...

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2025
@swolchok
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…7860)

getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions.

This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.

Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)!

Pull Request resolved: pytorch#167860
Approved by: https://github.com/zou3519
pytorchmergebot pushed a commit that referenced this pull request Nov 21, 2025
```
git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58
```

    And Revert "[DTensor] Document fast-path dispatch (#168192)"
    And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)"

Reverts:
* #167860
* #167588
* #167475
* #166808
* #166372
* #168192
* #168069

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #168264
Approved by: https://github.com/seemethere, https://github.com/malfet
ezyang added a commit that referenced this pull request Dec 1, 2025
This reverts commit 567dcdb.
This reverts commit 0e7ccc0.


ghstack-source-id: fec7bf5
Pull-Request: #169281
pytorchmergebot pushed a commit that referenced this pull request Dec 1, 2025
This reverts commit 567dcdb.
This reverts commit 0e7ccc0.


ghstack-source-id: 950b6d8
Pull-Request: #169281
ezyang added a commit to ezyang/pytorch that referenced this pull request Dec 4, 2025
…torch#169281

Summary:
This reverts commit 567dcdb.
This reverts commit 0e7ccc0.

These changes break downstream builds and also are the suspected culprit of a
memory corruption.

Test Plan: sandcastle

Differential Revision: D88394221
pytorchmergebot pushed a commit that referenced this pull request Dec 5, 2025
This reverts commit 567dcdb.
This reverts commit 0e7ccc0.

These changes break downstream builds and also are the suspected culprit of a memory corruption.
Pull Request resolved: #169281
Approved by: https://github.com/zou3519, https://github.com/albanD
umechand-amd pushed a commit to ROCm/pytorch that referenced this pull request Dec 8, 2025
…ytorch#169281)

This reverts commit 567dcdb.
This reverts commit 0e7ccc0.

These changes break downstream builds and also are the suspected culprit of a memory corruption.
Pull Request resolved: pytorch#169281
Approved by: https://github.com/zou3519, https://github.com/albanD
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
```
git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58
```

    And Revert "[DTensor] Document fast-path dispatch (#168192)"
    And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)"

Reverts:
* #167860
* #167588
* #167475
* #166808
* #166372
* #168192
* #168069

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #168264
Approved by: https://github.com/seemethere, https://github.com/malfet
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This reverts commit 567dcdb.
This reverts commit 0e7ccc0.

These changes break downstream builds and also are the suspected culprit of a memory corruption.
Pull Request resolved: #169281
Approved by: https://github.com/zou3519, https://github.com/albanD
@github-actions github-actions bot deleted the gh/swolchok/871/head branch December 18, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported release notes: jit release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants