Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace" by t-ivan-gr · Pull Request #167722 · pytorch/pytorch

t-ivan-gr · 2025-11-13T12:33:20Z

Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.
This diff adds mutexes to synchronize access to the static maps.

Note: this is a re-land of D86316117 / #167248 (see comments for details)

Test Plan:
Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN:

buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100

https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

TSAN output (before synchronization was added): P2026731804

Differential Revision: D86964261

…sLtWorkspace" Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261

pytorch-bot · 2025-11-13T12:33:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167722

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 27665b6 with merge base 4de24bc ():

NEW FAILURE - The following job has failed:

trunk / linux-jammy-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.gfx942.1) (gh)
ninja: build stopped: subcommand failed

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 2, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (similar failure)
test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_False_use_autocast_False_d_model_12_cuda

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-13T12:33:27Z

@t-ivan-gr has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86964261.

t-ivan-gr · 2025-11-13T12:36:18Z

@pytorchbot label "topic: not user facing"

t-ivan-gr · 2025-11-13T14:54:14Z

Note: this is a re-land of D86316117 / #167248. Please refer to dicsussion on that PR. It was reverted, I believe, in error since re-creation of the same PR no longer triggers any CI failures. The PR was recreated (rather than reused) to prevent issues withphabricator<>github sync tooling.

t-ivan-gr · 2025-11-14T16:06:25Z

With regards to the reason why the initial PR was reverted: I was able to bisect the failure to #166891

The failure is host-specific for some reason: it consistently fails on some hosts but consistently passes on others.

This flakiness pattern led to attribution logic to get confused and revert the wrong PR

t-ivan-gr · 2025-11-14T16:08:02Z

@pytorchbot merge

pytorchmergebot · 2025-11-14T16:10:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-auto-revert · 2025-11-14T19:36:48Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

trunk: test_transformers.py::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_False_d_model_256_cuda (hud), test_transformers.py::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_False_use_autocast_False_d_model_256_cuda (hud), test_transformers.py::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_False_use_autocast_False_d_model_12_cuda (hud), test_transformers.py::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_False_d_model_12_cuda (hud)

Please investigate and fix the issues.

pytorchmergebot · 2025-11-14T19:38:21Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…tCUDABlasLtWorkspace" (#167722)" This reverts commit 40e6f09. Reverted #167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#167722 (comment)))

pytorchmergebot · 2025-11-14T19:38:28Z

@t-ivan-gr your PR has been successfully reverted.

ngimel · 2025-11-14T19:54:35Z

It looks like tolerance needs to be slightly relaxed on those tests

…tCUDABlasLtWorkspace" (pytorch#167722)" This reverts commit 40e6f09. Reverted pytorch#167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167722 (comment)))

…sLtWorkspace" (pytorch#167722) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 (see comments for details) Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261 Pull Request resolved: pytorch#167722 Approved by: https://github.com/malfet

…tCUDABlasLtWorkspace" (pytorch#167722)" This reverts commit 40e6f09. Reverted pytorch#167722 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167722 (comment)))

github-actions · 2026-01-14T00:50:11Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

t-ivan-gr requested review from Aidyn-A, eqy and syed-ahmed as code owners November 13, 2025 12:33

meta-codesync bot added fb-exported meta-exported labels Nov 13, 2025

t-ivan-gr marked this pull request as draft November 13, 2025 12:35

pytorch-bot bot added the topic: not user facing topic category label Nov 13, 2025

t-ivan-gr marked this pull request as ready for review November 13, 2025 14:54

t-ivan-gr requested a review from malfet November 13, 2025 14:55

malfet added ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request labels Nov 13, 2025

malfet approved these changes Nov 14, 2025

View reviewed changes

pytorchmergebot added the merging label Nov 14, 2025

pytorchmergebot closed this in 40e6f09 Nov 14, 2025

pytorchmergebot added Merged and removed merging labels Nov 14, 2025

pytorchmergebot added the Reverted label Nov 14, 2025

pytorchmergebot reopened this Nov 14, 2025

github-actions bot added the Stale label Jan 14, 2026

t-ivan-gr closed this Jan 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace"#167722

Re-land "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace"#167722
t-ivan-gr wants to merge 1 commit intopytorch:mainfrom
t-ivan-gr:export-D86964261

t-ivan-gr commented Nov 13, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 13, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 13, 2025

Uh oh!

t-ivan-gr commented Nov 13, 2025

Uh oh!

t-ivan-gr commented Nov 13, 2025

Uh oh!

t-ivan-gr commented Nov 14, 2025

Uh oh!

t-ivan-gr commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Uh oh!

pytorch-auto-revert bot commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Uh oh!

ngimel commented Nov 14, 2025

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

t-ivan-gr commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167722

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

meta-codesync bot commented Nov 13, 2025

Uh oh!

t-ivan-gr commented Nov 13, 2025

Uh oh!

t-ivan-gr commented Nov 13, 2025

Uh oh!

t-ivan-gr commented Nov 14, 2025

Uh oh!

t-ivan-gr commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Merge started

Uh oh!

pytorch-auto-revert bot commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Uh oh!

ngimel commented Nov 14, 2025

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

t-ivan-gr commented Nov 13, 2025 •

edited

Loading

pytorch-bot bot commented Nov 13, 2025 •

edited

Loading