Skip to content

Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace"#167722

Closed
t-ivan-gr wants to merge 1 commit intopytorch:mainfrom
t-ivan-gr:export-D86964261
Closed

Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace"#167722
t-ivan-gr wants to merge 1 commit intopytorch:mainfrom
t-ivan-gr:export-D86964261

Conversation

@t-ivan-gr
Copy link
Contributor

@t-ivan-gr t-ivan-gr commented Nov 13, 2025

Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.

Note: this is a re-land of D86316117 / #167248 (see comments for details)

Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:

buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100

https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

TSAN output (before synchronization was added): P2026731804

Differential Revision: D86964261

…sLtWorkspace"

Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.

Note: this is a re-land of D86316117 / pytorch#167248

Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:
```
buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
```
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118


TSAN output (before synchronization was added): P2026731804

Differential Revision: D86964261
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167722

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 27665b6 with merge base 4de24bc (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-codesync
Copy link

meta-codesync bot commented Nov 13, 2025

@t-ivan-gr has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86964261.

@t-ivan-gr
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 13, 2025
@t-ivan-gr
Copy link
Contributor Author

Note: this is a re-land of D86316117 / #167248. Please refer to dicsussion on that PR. It was reverted, I believe, in error since re-creation of the same PR no longer triggers any CI failures. The PR was recreated (rather than reused) to prevent issues withphabricator<>github sync tooling.

@t-ivan-gr t-ivan-gr marked this pull request as ready for review November 13, 2025 14:54
@t-ivan-gr t-ivan-gr requested a review from malfet November 13, 2025 14:55
@malfet malfet added ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request labels Nov 13, 2025
@t-ivan-gr
Copy link
Contributor Author

With regards to the reason why the initial PR was reverted: I was able to bisect the failure to #166891

The failure is host-specific for some reason: it consistently fails on some hosts but consistently passes on others.

This flakiness pattern led to attribution logic to get confused and revert the wrong PR

@t-ivan-gr
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-auto-revert
Copy link

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Nov 14, 2025
…tCUDABlasLtWorkspace" (#167722)"

This reverts commit 40e6f09.

Reverted #167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#167722 (comment)))
@pytorchmergebot
Copy link
Collaborator

@t-ivan-gr your PR has been successfully reverted.

@ngimel
Copy link
Collaborator

ngimel commented Nov 14, 2025

It looks like tolerance needs to be slightly relaxed on those tests

jsuarez5341 pushed a commit to PufferAI/pytorch that referenced this pull request Nov 15, 2025
…tCUDABlasLtWorkspace" (pytorch#167722)"

This reverts commit 40e6f09.

Reverted pytorch#167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167722 (comment)))
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…sLtWorkspace" (pytorch#167722)

Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.

Note: this is a re-land of D86316117 / pytorch#167248 (see comments for details)

Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:
```
buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
```
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

TSAN output (before synchronization was added): P2026731804

Differential Revision: D86964261

Pull Request resolved: pytorch#167722
Approved by: https://github.com/malfet
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…tCUDABlasLtWorkspace" (pytorch#167722)"

This reverts commit 40e6f09.

Reverted pytorch#167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167722 (comment)))
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jan 14, 2026
@t-ivan-gr t-ivan-gr closed this Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported Reverted Stale topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants