Fix longstanding race condition around getAllOperatorsFor#167860
Fix longstanding race condition around getAllOperatorsFor#167860swolchok wants to merge 1 commit intogh/swolchok/871/basefrom
Conversation
getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion), this is a straightforward way to cause callers to create race conditions. This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity. Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)! [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167860
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8d169ad with merge base 2aba180 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion), this is a straightforward way to cause callers to create race conditions. This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity. Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)! ghstack-source-id: 323363321 Pull Request resolved: #167860
|
The label |
| // the "first" op may change depending on registration order. | ||
| std::vector<std::shared_ptr<Operator>> sortedOps; | ||
| sortedOps.reserve(unsortedOps.size()); | ||
| std::copy_if( |
There was a problem hiding this comment.
Is this really faster than a stable Boolean sort?
There was a problem hiding this comment.
This PR makes no judgement on that; this is existing code that we're not changing. (however I note that this is two passes over the input whereas a sort is going to take N log N)
| } | ||
|
|
||
| const std::vector<std::shared_ptr<Operator>>& getOperatorsWithLockHeld( | ||
| Symbol name) { |
There was a problem hiding this comment.
Is symbol really trivial to take by value or should this be const ref?
There was a problem hiding this comment.
it fits in a register and has no copy ctor
| auto it = operators.find(name); | ||
| if (it != operators.end()) | ||
| return it->second; | ||
| return empty; |
There was a problem hiding this comment.
Ugh this is weird. If only we had constexpr std::vector before C++20...
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…7860) getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions. This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity. Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)! Pull Request resolved: pytorch#167860 Approved by: https://github.com/zou3519
``` git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58 ``` And Revert "[DTensor] Document fast-path dispatch (#168192)" And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)" Reverts: * #167860 * #167588 * #167475 * #166808 * #166372 * #168192 * #168069 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #168264 Approved by: https://github.com/seemethere, https://github.com/malfet
…torch#169281 Summary: This reverts commit 567dcdb. This reverts commit 0e7ccc0. These changes break downstream builds and also are the suspected culprit of a memory corruption. Test Plan: sandcastle Differential Revision: D88394221
This reverts commit 567dcdb. This reverts commit 0e7ccc0. These changes break downstream builds and also are the suspected culprit of a memory corruption. Pull Request resolved: #169281 Approved by: https://github.com/zou3519, https://github.com/albanD
…ytorch#169281) This reverts commit 567dcdb. This reverts commit 0e7ccc0. These changes break downstream builds and also are the suspected culprit of a memory corruption. Pull Request resolved: pytorch#169281 Approved by: https://github.com/zou3519, https://github.com/albanD
``` git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58 ``` And Revert "[DTensor] Document fast-path dispatch (#168192)" And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)" Reverts: * #167860 * #167588 * #167475 * #166808 * #166372 * #168192 * #168069 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #168264 Approved by: https://github.com/seemethere, https://github.com/malfet
This reverts commit 567dcdb. This reverts commit 0e7ccc0. These changes break downstream builds and also are the suspected culprit of a memory corruption. Pull Request resolved: #169281 Approved by: https://github.com/zou3519, https://github.com/albanD
Stack from ghstack (oldest at bottom):
getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions.
This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.
Differential Revision: D87088731
NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!
cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel