Add DeviceAllocator as the base device allocator#138222
Add DeviceAllocator as the base device allocator#138222guangyey wants to merge 95 commits intogh/guangyey/79/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138222
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 5 Pending, 2 Unrelated FailuresAs of commit 5521326 with merge base 178515d ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
it's hundreds of tests, so it's not a trivial change. But the issue persists in tests that use this method of detecting ROCm which is why I'm still confused how the change impacted it |
Do you mean the internal test is already using Lines 485 to 495 in 3db8623 So I can't understand why this is the root case. |
|
@guangyey could the change have anything to do with this error? I wonder if the issue isn't checking for ROCm but maybe specifically checking for cuda
|
|
I think the error is related to #158295, which introduces a CUDA version check. CUDA 12.5+ supports pytorch/c10/cuda/driver_api.cpp Lines 1 to 2 in 67e68e0 |
|
@guangyey sorry for the delayed response. It looks like the segfault stems from this file: aten/src/ATen/hip/impl/HIPAllocatorMasqueradingAsCUDA.h, specifically line 19 |
@Camyll Thanks so much for the detailed debug info. Based on what you shared, I suspect this issue might be caused by a diff --git a/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp b/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp
index 19bc0a6b34e..68ddc09b53b 100644
--- a/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp
+++ b/aten/src/ATen/hip/impl/HIPCachingAllocatorMasqueradingAsCUDA.cpp
@@ -4,9 +4,10 @@
namespace c10 { namespace hip {
namespace HIPCachingAllocatorMasqueradingAsCUDA {
-static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
Allocator* get() {
+ static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
return &allocator;
}
@@ -16,7 +17,7 @@ void recordStreamMasqueradingAsCUDA(const DataPtr& ptr, HIPStreamMasqueradingAsC
// Register this HIP allocator as CUDA allocator to enable access through both
// c10::GetAllocator(kCUDA) and c10::getDeviceAllocator(kCUDA) APIs
-REGISTER_ALLOCATOR(kCUDA, &allocator)
+// REGISTER_ALLOCATOR(kCUDA, &allocator)
} // namespace HIPCachingAllocatorMasqueradingAsCUDA
}} // namespace c10::hip |
| #define HIP_MASQUERADING_AS_CUDA \ | ||
| "cud" \ | ||
| "a" | ||
| at::SetAllocator(c10::Device(HIP_MASQUERADING_AS_CUDA).type(), r, 0); |
There was a problem hiding this comment.
@albanD As @Camyll pointed out here, I think the root cause is a static initialization order fiasco (SIOF)—something I’ve been struggling with recently (e.g., this PR).
To avoid static initializer dependencies during static init time, I addressed the issue in this commit: 4134108.
Additionally, I performed the CUDA masquerading here to support both c10::GetAllocator(kCUDA) and c10::getDeviceAllocator(kCUDA) for the HIP backend. Do you think it is acceptable.
There was a problem hiding this comment.
@guangyey thanks for the changes! I tested this internally and the tests are no longer failing. We are good to reland this
There was a problem hiding this comment.
@Camyll Thanks, I will rebase and then try to reland it.
|
@pytorchbot merge |
1 similar comment
|
@pytorchbot merge |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Check mergeability of ghstack PR / ghstack-mergeability-check Details for Dev Infra teamRaised by workflow job |
|
Starting merge as part of PR stack under #155200 |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: rocm / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.2), xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 5, 6, linux.idc.xpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 6 checks: Check Labels / Check labels, Check mergeability of ghstack PR / ghstack-mergeability-check, pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable), rocm / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.2), xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 2, 6, linux.idc.xpu), xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 5, 6, linux.idc.xpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -c nosignal -m "Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573" cc @guangyey If this revert doesn't go through because it's part of a stack, please forward fix the issue. |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@guangyey your PR has been successfully reverted. |
|
@jithunnair-amd I think the failure is not introduced by this PR, I see the same failure |
|
Hi @jithunnair-amd , the |
|
Starting merge as part of PR stack under #155200 |
Stack from ghstack (oldest at bottom):
Motivation
In line with [RFC] A device-agnostic Python device memory related API design for stream-based accelerators, some memory-related APIs are widely used in popular repositories, such as HuggingFace so many if-else conditional code. We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.
Solution
This design follows a similar pattern to
HostAllocator. We're introducing a base classDeviceAllocator, from whichCUDAAllocatorandXPUAllocatorwill inherit. This allows us to provide a unified call path like:torch.accelerator.empty_cache()->GetDeviceAllocator(allocator)->empty_cache().cc @albanD @EikanWang