Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace"#167722
Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace"#167722t-ivan-gr wants to merge 1 commit intopytorch:mainfrom
Conversation
…sLtWorkspace" Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167722
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 27665b6 with merge base 4de24bc ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@t-ivan-gr has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86964261. |
|
@pytorchbot label "topic: not user facing" |
|
Note: this is a re-land of D86316117 / #167248. Please refer to dicsussion on that PR. It was reverted, I believe, in error since re-creation of the same PR no longer triggers any CI failures. The PR was recreated (rather than reused) to prevent issues withphabricator<>github sync tooling. |
|
With regards to the reason why the initial PR was reverted: I was able to bisect the failure to #166891 The failure is host-specific for some reason: it consistently fails on some hosts but consistently passes on others. This flakiness pattern led to attribution logic to get confused and revert the wrong PR |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert This PR is attributed to have caused regression in:
Please investigate and fix the issues. |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…tCUDABlasLtWorkspace" (#167722)" This reverts commit 40e6f09. Reverted #167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#167722 (comment)))
|
@t-ivan-gr your PR has been successfully reverted. |
|
It looks like tolerance needs to be slightly relaxed on those tests |
…tCUDABlasLtWorkspace" (pytorch#167722)" This reverts commit 40e6f09. Reverted pytorch#167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167722 (comment)))
…sLtWorkspace" (pytorch#167722) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 (see comments for details) Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261 Pull Request resolved: pytorch#167722 Approved by: https://github.com/malfet
…tCUDABlasLtWorkspace" (pytorch#167722)" This reverts commit 40e6f09. Reverted pytorch#167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167722 (comment)))
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.
Note: this is a re-land of D86316117 / #167248 (see comments for details)
Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118
TSAN output (before synchronization was added): P2026731804
Differential Revision: D86964261