Skip to content

Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig#150312

Closed
guangyey wants to merge 101 commits intogh/guangyey/133/basefrom
gh/guangyey/133/head
Closed

Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig#150312
guangyey wants to merge 101 commits intogh/guangyey/133/basefrom
gh/guangyey/133/head

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Mar 31, 2025

Stack from ghstack (oldest at bottom):

Motivation

Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig and ConfigTokenizer. We would deprecate those option that overleap with AcceleratorAllocatorConfig in the following PR and keep them only for BC.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@guangyey guangyey requested review from eqy and syed-ahmed as code owners March 31, 2025 16:12
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 31, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150312

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (3 Unrelated Failures)

As of commit df9befb with merge base bb67660 (image):

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey changed the title Refactor CUDAAllocatorConfig to reuse AllocatorConfig [WIP] Refactor CUDAAllocatorConfig to reuse AllocatorConfig Mar 31, 2025
guangyey added 21 commits March 31, 2025 23:46
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey guangyey added release notes: cpp release notes category topic: not user facing topic category labels Apr 15, 2025
guangyey added 3 commits July 16, 2025 16:03
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
…llocatorConfig instead (#156165)

Pull Request resolved: #156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: #156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
[ghstack-poisoned]
@ScottTodd
Copy link
Contributor

Hi, I'm noticing downstream failures that a bisect tracked back to this commit: ROCm/TheRock#1155.

We're building PyTorch with ROCm on Windows and at runtime we have new failures to load c10_hip.dll:

(.venv) λ python
Python 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\__init__.py", line 281, in <module>
    _load_dll_libraries()
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\__init__.py", line 264, in _load_dll_libraries
    raise err
OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\lib\c10_hip.dll" or one of its dependencies.

This commit does not cleanly revert because there is a sequence of other commits stacked on top: https://github.com/pytorch/pytorch/commits/main/c10/cuda

I'm trying to debug what has changed and why the load is failing. Do you have any advice or ideas?

@guangyey
Copy link
Collaborator Author

Hi @ScottTodd, I think the big code change in this PR is that we reuse CUDAAllocatorConfig in CUDACachingAllocator as follows and decouple the mutual dependence between them (now CUDACachingAllocator depends on CUDAAllocatorConfig).

// If the environment variable is set, we use the CudaMallocAsync allocator.
if (CUDAAllocatorConfig::use_async_allocator()) {
return CudaMallocAsync::allocator();
}
return &Native::allocator;

This line ensures that CUDAAllocatorConfig is initialized at static initialization time. I don't think this will invoke the static initialization order fiasco issue. However, I’m not entirely sure how this behaves on ROCm under Windows. You might try commenting out this line and rebuilding to see if the issue persists.

@ScottTodd
Copy link
Contributor

Same errors with this diff applied:

λ git diff
diff --git a/c10/cuda/CUDACachingAllocator.cpp b/c10/cuda/CUDACachingAllocator.cpp
index b0b1be8937a..052d6006034 100644
--- a/c10/cuda/CUDACachingAllocator.cpp
+++ b/c10/cuda/CUDACachingAllocator.cpp
@@ -4128,9 +4128,9 @@ CUDAAllocator* allocator();
 struct BackendStaticInitializer {
   CUDAAllocator* parseEnvForBackend() {
     // If the environment variable is set, we use the CudaMallocAsync allocator.
-    if (CUDAAllocatorConfig::use_async_allocator()) {
-      return CudaMallocAsync::allocator();
-    }
+    // if (CUDAAllocatorConfig::use_async_allocator()) {
+    //   return CudaMallocAsync::allocator();
+    // }
     return &Native::allocator;
   }

diff --git a/c10/hip/HIPCachingAllocator.cpp b/c10/hip/HIPCachingAllocator.cpp
index a2455ac42d5..f4c61fcbe1d 100644
--- a/c10/hip/HIPCachingAllocator.cpp
+++ b/c10/hip/HIPCachingAllocator.cpp
@@ -4129,9 +4129,9 @@ HIPAllocator* allocator();
 struct BackendStaticInitializer {
   HIPAllocator* parseEnvForBackend() {
     // If the environment variable is set, we use the HipMallocAsync allocator.
-    if (HIPAllocatorConfig::use_async_allocator()) {
-      return HipMallocAsync::allocator();
-    }
+    // if (HIPAllocatorConfig::use_async_allocator()) {
+    //   return HipMallocAsync::allocator();
+    // }
     return &Native::allocator;
   }

I'd like to get more logs, even printfs, or a debugger attached somehow. This is tricky to debug when all I'm getting is A dynamic link library (DLL) initialization routine failed. Error loading "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\lib\c10_hip.dll" or one of its dependencies.

  • Maybe there are other build flags I can set to get output from TORCH_CHECK, if that is detecting issues?
  • Maybe if I run some of the c10 c++ tests? I've had BUILD_TEST=0 set when running setup.py bdist_wheel due to other unrelated issues though 🤔

@ScottTodd
Copy link
Contributor

Alright, not much luck debugging, but on closer inspection of the code there is an initialization order issue here that is fixed by moving _keys from inline static at class scope into the function:

--- a/c10/hip/HIPAllocatorConfig.h
+++ b/c10/hip/HIPAllocatorConfig.h
@@ -112,7 +112,22 @@ class C10_HIP_API HIPAllocatorConfig {
   }

   static const std::unordered_set<std::string>& getKeys() {
-    return keys_;
+
+    static std::unordered_set<std::string> keys = {
+      "backend",
+      // keep BC for Rocm: `cuda` -> `cud` `a`, to avoid hipify issues
+      // NOLINTBEGIN(bugprone-suspicious-missing-comma,-warnings-as-errors)
+      "release_lock_on_cud"
+      "amalloc",
+      "pinned_use_cud"
+      "a_host_register",
+      // NOLINTEND(bugprone-suspicious-missing-comma,-warnings-as-errors)
+      "release_lock_on_hipmalloc",
+      "pinned_use_hip_host_register",
+      "pinned_num_register_threads",
+    };
+
+    return keys;
   }

   static HIPAllocatorConfig& instance() {
@@ -164,18 +179,6 @@ class C10_HIP_API HIPAllocatorConfig {
   std::atomic<bool> m_pinned_use_hip_host_register{false};
   std::atomic<bool> m_use_async_allocator{false};
   std::atomic<bool> m_is_allocator_loaded{false};
-  inline static std::unordered_set<std::string> keys_{
-      "backend",
-      // keep BC for Rocm: `cuda` -> `cud` `a`, to avoid hipify issues
-      // NOLINTBEGIN(bugprone-suspicious-missing-comma,-warnings-as-errors)
-      "release_lock_on_cud"
-      "amalloc",
-      "pinned_use_cud"
-      "a_host_register",
-      // NOLINTEND(bugprone-suspicious-missing-comma,-warnings-as-errors)
-      "release_lock_on_hipmalloc",
-      "pinned_use_hip_host_register",
-      "pinned_num_register_threads"};
 };

The C++ runtime must be fully initialized before STL containers like std::string and std::unordered_set work and static initialization can happen before the CRT is ready.

I can send a PR with the fix.

@ScottTodd
Copy link
Contributor

Ah, though c8cf811 also mutates keys_, so another fix like using lazy initialization with std::once_flag may be needed there. What would you like to do? Revert? Fix forward yourself? I can try to put together a full fix too.

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 1, 2025

@pytorchbot revert -m 'Static initialization order issue impact the downstream repo' -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Aug 1, 2025
…56175)"

This reverts commit d3ce450.

Reverted #156175 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](#150312 (comment)))
pytorchmergebot added a commit that referenced this pull request Aug 1, 2025
…leratorAllocatorConfig instead (#156165)"

This reverts commit 1fc010a.

Reverted #156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](#150312 (comment)))
@pytorchmergebot
Copy link
Collaborator

@guangyey your PR has been successfully reverted.

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 1, 2025

Ah, though c8cf811 also mutates keys_, so another fix like using lazy initialization with std::once_flag may be needed there. What would you like to do? Revert? Fix forward yourself? I can try to put together a full fix too.

Sorry for the disruption. Let's revert this PR first to unblock your progress.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

3 similar comments
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/h100-distributed ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged no-stale oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: cpp release notes category Reverted Stale topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants