Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig by guangyey · Pull Request #150312 · pytorch/pytorch

guangyey · 2025-03-31T16:12:35Z

Stack from ghstack (oldest at bottom):

Motivation

Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig and ConfigTokenizer. We would deprecate those option that overleap with AcceleratorAllocatorConfig in the following PR and keep them only for BC.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2025-03-31T16:12:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150312

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

✅ You can merge normally! (3 Unrelated Failures)

As of commit df9befb with merge base bb67660 ():

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server
rocm / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.2, unstable) (gh)
inductor/test_max_autotune.py::TestMaxAutotune::test_triton_template_generated_code_caching_mm_plus_mm
rocm / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.2, unstable) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

pytorchmergebot · 2025-07-30T06:30:50Z

Starting merge as part of PR stack under #156175

…llocatorConfig instead (#156165) Pull Request resolved: #156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312

# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165

[ghstack-poisoned]

ScottTodd · 2025-07-30T22:27:00Z

Hi, I'm noticing downstream failures that a bisect tracked back to this commit: ROCm/TheRock#1155.

We're building PyTorch with ROCm on Windows and at runtime we have new failures to load c10_hip.dll:

(.venv) λ python
Python 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\__init__.py", line 281, in <module>
    _load_dll_libraries()
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\__init__.py", line 264, in _load_dll_libraries
    raise err
OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\lib\c10_hip.dll" or one of its dependencies.

This commit does not cleanly revert because there is a sequence of other commits stacked on top: https://github.com/pytorch/pytorch/commits/main/c10/cuda

I'm trying to debug what has changed and why the load is failing. Do you have any advice or ideas?

guangyey · 2025-07-31T03:37:18Z

Hi @ScottTodd, I think the big code change in this PR is that we reuse CUDAAllocatorConfig in CUDACachingAllocator as follows and decouple the mutual dependence between them (now CUDACachingAllocator depends on CUDAAllocatorConfig).

pytorch/c10/cuda/CUDACachingAllocator.cpp

Lines 4130 to 4134 in c68ad1b

    
           // If the environment variable is set, we use the CudaMallocAsync allocator. 
        
           if (CUDAAllocatorConfig::use_async_allocator()) { 
        
             return CudaMallocAsync::allocator(); 
        
           } 
        
           return &Native::allocator;

This line ensures that CUDAAllocatorConfig is initialized at static initialization time. I don't think this will invoke the static initialization order fiasco issue. However, I’m not entirely sure how this behaves on ROCm under Windows. You might try commenting out this line and rebuilding to see if the issue persists.

ScottTodd · 2025-07-31T18:44:02Z

Same errors with this diff applied:

λ git diff
diff --git a/c10/cuda/CUDACachingAllocator.cpp b/c10/cuda/CUDACachingAllocator.cpp
index b0b1be8937a..052d6006034 100644
--- a/c10/cuda/CUDACachingAllocator.cpp
+++ b/c10/cuda/CUDACachingAllocator.cpp
@@ -4128,9 +4128,9 @@ CUDAAllocator* allocator();
 struct BackendStaticInitializer {
   CUDAAllocator* parseEnvForBackend() {
     // If the environment variable is set, we use the CudaMallocAsync allocator.
-    if (CUDAAllocatorConfig::use_async_allocator()) {
-      return CudaMallocAsync::allocator();
-    }
+    // if (CUDAAllocatorConfig::use_async_allocator()) {
+    //   return CudaMallocAsync::allocator();
+    // }
     return &Native::allocator;
   }

diff --git a/c10/hip/HIPCachingAllocator.cpp b/c10/hip/HIPCachingAllocator.cpp
index a2455ac42d5..f4c61fcbe1d 100644
--- a/c10/hip/HIPCachingAllocator.cpp
+++ b/c10/hip/HIPCachingAllocator.cpp
@@ -4129,9 +4129,9 @@ HIPAllocator* allocator();
 struct BackendStaticInitializer {
   HIPAllocator* parseEnvForBackend() {
     // If the environment variable is set, we use the HipMallocAsync allocator.
-    if (HIPAllocatorConfig::use_async_allocator()) {
-      return HipMallocAsync::allocator();
-    }
+    // if (HIPAllocatorConfig::use_async_allocator()) {
+    //   return HipMallocAsync::allocator();
+    // }
     return &Native::allocator;
   }

I'd like to get more logs, even printfs, or a debugger attached somehow. This is tricky to debug when all I'm getting is A dynamic link library (DLL) initialization routine failed. Error loading "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\lib\c10_hip.dll" or one of its dependencies.

Maybe there are other build flags I can set to get output from TORCH_CHECK, if that is detecting issues?
Maybe if I run some of the c10 c++ tests? I've had BUILD_TEST=0 set when running setup.py bdist_wheel due to other unrelated issues though 🤔

ScottTodd · 2025-07-31T22:24:05Z

Alright, not much luck debugging, but on closer inspection of the code there is an initialization order issue here that is fixed by moving _keys from inline static at class scope into the function:

--- a/c10/hip/HIPAllocatorConfig.h
+++ b/c10/hip/HIPAllocatorConfig.h
@@ -112,7 +112,22 @@ class C10_HIP_API HIPAllocatorConfig {
   }

   static const std::unordered_set<std::string>& getKeys() {
-    return keys_;
+
+    static std::unordered_set<std::string> keys = {
+      "backend",
+      // keep BC for Rocm: `cuda` -> `cud` `a`, to avoid hipify issues
+      // NOLINTBEGIN(bugprone-suspicious-missing-comma,-warnings-as-errors)
+      "release_lock_on_cud"
+      "amalloc",
+      "pinned_use_cud"
+      "a_host_register",
+      // NOLINTEND(bugprone-suspicious-missing-comma,-warnings-as-errors)
+      "release_lock_on_hipmalloc",
+      "pinned_use_hip_host_register",
+      "pinned_num_register_threads",
+    };
+
+    return keys;
   }

   static HIPAllocatorConfig& instance() {
@@ -164,18 +179,6 @@ class C10_HIP_API HIPAllocatorConfig {
   std::atomic<bool> m_pinned_use_hip_host_register{false};
   std::atomic<bool> m_use_async_allocator{false};
   std::atomic<bool> m_is_allocator_loaded{false};
-  inline static std::unordered_set<std::string> keys_{
-      "backend",
-      // keep BC for Rocm: `cuda` -> `cud` `a`, to avoid hipify issues
-      // NOLINTBEGIN(bugprone-suspicious-missing-comma,-warnings-as-errors)
-      "release_lock_on_cud"
-      "amalloc",
-      "pinned_use_cud"
-      "a_host_register",
-      // NOLINTEND(bugprone-suspicious-missing-comma,-warnings-as-errors)
-      "release_lock_on_hipmalloc",
-      "pinned_use_hip_host_register",
-      "pinned_num_register_threads"};
 };

The C++ runtime must be fully initialized before STL containers like std::string and std::unordered_set work and static initialization can happen before the CRT is ready.

I can send a PR with the fix.

ScottTodd · 2025-07-31T22:39:57Z

Ah, though c8cf811 also mutates keys_, so another fix like using lazy initialization with std::once_flag may be needed there. What would you like to do? Revert? Fix forward yourself? I can try to put together a full fix too.

guangyey · 2025-08-01T03:23:06Z

@pytorchbot revert -m 'Static initialization order issue impact the downstream repo' -c ghfirst

pytorchmergebot · 2025-08-01T03:24:40Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…56175)" This reverts commit d3ce450. Reverted #156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](#150312 (comment)))

…leratorAllocatorConfig instead (#156165)" This reverts commit 1fc010a. Reverted #156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](#150312 (comment)))

pytorchmergebot · 2025-08-01T03:25:03Z

@guangyey your PR has been successfully reverted.

guangyey · 2025-08-01T03:27:46Z

Ah, though c8cf811 also mutates keys_, so another fix like using lazy initialization with std::once_flag may be needed there. What would you like to do? Revert? Fix forward yourself? I can try to put together a full fix too.

Sorry for the disruption. Let's revert this PR first to unblock your progress.

[ghstack-poisoned]

pytorchmergebot · 2025-08-05T02:25:57Z

Starting merge as part of PR stack under #156175

pytorchmergebot · 2025-08-05T02:44:43Z

Starting merge as part of PR stack under #156175

pytorchmergebot · 2025-08-05T03:53:52Z

Starting merge as part of PR stack under #156175

pytorchmergebot · 2025-08-05T04:01:37Z

Starting merge as part of PR stack under #156175

guangyey requested review from eqy and syed-ahmed as code owners March 31, 2025 16:12

This was referenced Mar 31, 2025

Introduce AcceleratorAllocatorConfig as the common class #149601

Closed

Add DeviceAllocator as the base device allocator #138222

Closed

pytorchbot added the open source label Mar 31, 2025

guangyey changed the title ~~Refactor CUDAAllocatorConfig to reuse AllocatorConfig~~ [WIP] Refactor CUDAAllocatorConfig to reuse AllocatorConfig Mar 31, 2025

guangyey added 21 commits March 31, 2025 23:46

Update

663cdff

[ghstack-poisoned]

Update

f022ed7

[ghstack-poisoned]

Update

52ae02b

[ghstack-poisoned]

Update

c951fd9

[ghstack-poisoned]

Update

0ea28f9

[ghstack-poisoned]

Update

080a3ff

[ghstack-poisoned]

Update

673db93

[ghstack-poisoned]

Update

47d8667

[ghstack-poisoned]

Update

1ac29cf

[ghstack-poisoned]

Update

218b097

[ghstack-poisoned]

Update

0c1d6cb

[ghstack-poisoned]

Update

cd9704a

[ghstack-poisoned]

Update

41a0879

[ghstack-poisoned]

Update

6ea72f0

[ghstack-poisoned]

Update

0e4b237

[ghstack-poisoned]

Update

15c2aae

[ghstack-poisoned]

Update

cc158dd

[ghstack-poisoned]

Update

931f30f

[ghstack-poisoned]

Update

bd2d75c

[ghstack-poisoned]

Update

9aecb48

[ghstack-poisoned]

Update

5766561

[ghstack-poisoned]

guangyey added release notes: cpp release notes category topic: not user facing topic category labels Apr 15, 2025

guangyey added 3 commits July 16, 2025 16:03

Update

8187827

[ghstack-poisoned]

Update

f8e477f

[ghstack-poisoned]

Update

f9c977a

[ghstack-poisoned]

pytorchmergebot closed this in dfacf11 Jul 30, 2025

Update

cca2472

[ghstack-poisoned]

ScottTodd mentioned this pull request Jul 31, 2025

Fix recently added AllocatorConfig static initializer code. #159607

Closed

guangyey mentioned this pull request Aug 1, 2025

Fix AllocatorConfig potential SIO issue #159629

Closed

Update

df9befb

[ghstack-poisoned]

joshuuuasu mentioned this pull request Aug 14, 2025

[PyTorch][CachingAllocatorConfig] back out D79620246 and D79620264 #160666

Closed

atalman mentioned this pull request Aug 27, 2025

Back out Generalize torch._C._set_allocator_settings to be generic #161620

Closed

guangyey mentioned this pull request Aug 29, 2025

[Reland] Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig #161786

Closed

Conversation

guangyey commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150312

❗ 1 Active SEVs

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

ScottTodd commented Jul 30, 2025

Uh oh!

guangyey commented Jul 31, 2025

Uh oh!

ScottTodd commented Jul 31, 2025

Uh oh!

ScottTodd commented Jul 31, 2025

Uh oh!

ScottTodd commented Jul 31, 2025

Uh oh!

guangyey commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Uh oh!

guangyey commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 5, 2025

Uh oh!

pytorchmergebot commented Aug 5, 2025

Uh oh!

pytorchmergebot commented Aug 5, 2025

Uh oh!

pytorchmergebot commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

guangyey commented Mar 31, 2025 •

edited

Loading

pytorch-bot bot commented Mar 31, 2025 •

edited

Loading