Skip to content

Add DeviceAllocator as the base device allocator#138222

Closed
guangyey wants to merge 95 commits intogh/guangyey/79/basefrom
gh/guangyey/79/head
Closed

Add DeviceAllocator as the base device allocator#138222
guangyey wants to merge 95 commits intogh/guangyey/79/basefrom
gh/guangyey/79/head

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Oct 17, 2024

Stack from ghstack (oldest at bottom):

Motivation

In line with [RFC] A device-agnostic Python device memory related API design for stream-based accelerators, some memory-related APIs are widely used in popular repositories, such as HuggingFace so many if-else conditional code. We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

Device-specific memory APIs torch.xxx.foo Device-agnostic memory APIs torch.accelerator.foo
torch.xxx.empty_cache
torch.accelerator.empty_cache
torch.xxx.reset_peak_memory_stats
torch.accelerator.reset_peak_memory_stats
torch.xxx.reset_accumulated_memory_stats
torch.accelerator.reset_accumulated_memory_stats
torch.xxx.memory_stats
torch.accelerator.memory_stats
torch.xxx.memory_allocated
torch.accelerator.memory_allocated
torch.xxx.max_memory_allocated
torch.accelerator.max_memory_allocated
torch.xxx.memory_reserved
torch.accelerator.memory_reserved
torch.xxx.max_memory_reserved
torch.accelerator.max_memory_reserved

Solution

This design follows a similar pattern to HostAllocator. We're introducing a base class DeviceAllocator, from which CUDAAllocator and XPUAllocator will inherit. This allows us to provide a unified call path like: torch.accelerator.empty_cache() -> GetDeviceAllocator(allocator)->empty_cache().

cc @albanD @EikanWang

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138222

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 5 Pending, 2 Unrelated Failures

As of commit 5521326 with merge base 178515d (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

guangyey added a commit that referenced this pull request Oct 17, 2024
@guangyey guangyey marked this pull request as draft October 17, 2024 15:06
@guangyey guangyey changed the title Add CachingDeviceAllocatorInterface as the base device allocator [WIP] Add CachingDeviceAllocatorInterface as the base device allocator Oct 17, 2024
[ghstack-poisoned]
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 16, 2024
guangyey added a commit that referenced this pull request Mar 4, 2025
[ghstack-poisoned]
@guangyey guangyey added topic: improvements topic category topic: not user facing topic category labels Mar 18, 2025
guangyey added a commit that referenced this pull request Mar 18, 2025
guangyey added a commit that referenced this pull request Mar 18, 2025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
guangyey added a commit that referenced this pull request Mar 20, 2025
guangyey added a commit that referenced this pull request Mar 20, 2025
[ghstack-poisoned]
[ghstack-poisoned]
guangyey added a commit that referenced this pull request Mar 21, 2025
@Camyll
Copy link
Contributor

Camyll commented Jul 25, 2025

torch.version.hip

it's hundreds of tests, so it's not a trivial change. But the issue persists in tests that use this method of detecting ROCm which is why I'm still confused how the change impacted it

@guangyey
Copy link
Collaborator Author

guangyey commented Jul 26, 2025

it's hundreds of tests, so it's not a trivial change. But the issue persists in tests that use this method of detecting ROCm which is why I'm still confused how the change impacted it

Do you mean the internal test is already using torch.version.hip to detect CUDA or ROCm, but it will return an unexpected value when integrating this PR?
The logic of generating torch.version.hip is listed below and should remain unaffected by this PR.

add_custom_target(
gen_torch_version ALL
"${Python_EXECUTABLE}" "${TOOLS_PATH}/generate_torch_version.py"
--is-debug=${TORCH_VERSION_DEBUG}
--cuda-version=${CUDA_VERSION}
--hip-version=${HIP_VERSION}
--xpu-version=${SYCL_COMPILER_VERSION}
BYPRODUCTS ${TORCH_SRC_DIR}/version.py
COMMENT "Regenerating version file..."
WORKING_DIRECTORY ${TORCH_ROOT}
)

So I can't understand why this is the root case.

@Camyll
Copy link
Contributor

Camyll commented Jul 28, 2025

@guangyey could the change have anything to do with this error? I wonder if the issue isn't checking for ROCm but maybe specifically checking for cuda

RuntimeError: r.cuDeviceGetAttribute_ INTERNAL ASSERT FAILED at "fbcode/caffe2/c10/cuda/driver_api.cpp":23, please report a bug to PyTorch. Can't find cuDeviceGetAttribute

@guangyey
Copy link
Collaborator Author

guangyey commented Jul 29, 2025

I think the error is related to #158295, which introduces a CUDA version check. CUDA 12.5+ supports cudaGetDriverEntryPointByVersion and other situations, CUDA uses cudaGetDriverEntryPoint to check if cuDeviceGetAttribute exists.
And ROCm should not support those driver API. Refer to

#if !defined(USE_ROCM) && defined(PYTORCH_C10_DRIVER_API_SUPPORTED)
#include <c10/cuda/CUDAException.h>

@Camyll
Copy link
Contributor

Camyll commented Jul 31, 2025

@guangyey sorry for the delayed response. It looks like the segfault stems from this file: aten/src/ATen/hip/impl/HIPAllocatorMasqueradingAsCUDA.h, specifically line 19

https://github.com/pytorch/pytorch/blob/b95cf5c91d8b4a9a2b905edd035ef331d7e0609a/aten/src/ATen/hip/impl/HIPAllocatorMasqueradingAsCUDA.h#L19C1-L20C1

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 1, 2025

/HIPAllocatorMasqueradingAsCUDA.

@Camyll Thanks so much for the detailed debug info. Based on what you shared, I suspect this issue might be caused by a static initialization order problem—which I’ve been dealing with recently. Would you mind helping me verify if that’s the case by trying the following patch?

diff --git a/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp b/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp
index 19bc0a6b34e..68ddc09b53b 100644
--- a/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp
+++ b/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp
@@ -4,9 +4,10 @@
 namespace c10 { namespace hip {
 namespace HIPCachingAllocatorMasqueradingAsCUDA {
 
-static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
 
 Allocator* get() {
+  static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
   return &allocator;
 }
 
@@ -16,7 +17,7 @@ void recordStreamMasqueradingAsCUDA(const DataPtr& ptr, HIPStreamMasqueradingAsC
 
 // Register this HIP allocator as CUDA allocator to enable access through both
 // c10::GetAllocator(kCUDA) and c10::getDeviceAllocator(kCUDA) APIs
-REGISTER_ALLOCATOR(kCUDA, &allocator)
+// REGISTER_ALLOCATOR(kCUDA, &allocator)
 
 } // namespace HIPCachingAllocatorMasqueradingAsCUDA
 }} // namespace c10::hip

#define HIP_MASQUERADING_AS_CUDA \
"cud" \
"a"
at::SetAllocator(c10::Device(HIP_MASQUERADING_AS_CUDA).type(), r, 0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD As @Camyll pointed out here, I think the root cause is a static initialization order fiasco (SIOF)—something I’ve been struggling with recently (e.g., this PR).
To avoid static initializer dependencies during static init time, I addressed the issue in this commit: 4134108.
Additionally, I performed the CUDA masquerading here to support both c10::GetAllocator(kCUDA) and c10::getDeviceAllocator(kCUDA) for the HIP backend. Do you think it is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guangyey thanks for the changes! I tested this internally and the tests are no longer failing. We are good to reland this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Camyll Thanks, I will rebase and then try to reland it.

[ghstack-poisoned]
[ghstack-poisoned]
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@albanD
Copy link
Collaborator

albanD commented Aug 5, 2025

@pytorchbot merge

1 similar comment
@guangyey
Copy link
Collaborator Author

guangyey commented Aug 5, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Check mergeability of ghstack PR / ghstack-mergeability-check

Details for Dev Infra team Raised by workflow job

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #155200

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 6, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: rocm / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.2), xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 5, 6, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 6, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

[ghstack-poisoned]
@jithunnair-amd
Copy link
Collaborator

@pytorchbot revert -c nosignal -m "Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573"

cc @guangyey If this revert doesn't go through because it's part of a stack, please forward fix the issue.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@guangyey your PR has been successfully reverted.

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 8, 2025

@jithunnair-amd I think the failure is not introduced by this PR, I see the same failure pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found in File "/var/lib/jenkins/pytorch/test/distributed/test_c10d_nccl.py", line 689, in test_extra_cuda_context in the previous commit.
Anyway, let's add the label ciflow/periodic-rocm-mi300 to see what CI says. And I will reland this PR once this CI pass.

@guangyey
Copy link
Collaborator Author

guangyey commented Aug 8, 2025

Hi @jithunnair-amd , the ciflow/periodic-rocm-mi300 has passed on this PR, see https://github.com/pytorch/pytorch/actions/runs/16825034603/job/47659415983?pr=138222
I would like to reland this PR again.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #155200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300/MI325 ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: accelerator Issues related to the shared accelerator API no-stale open source Reverted Stale topic: improvements topic category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants