Fix CUDA allocator SIOF causing torch.accelerator API failures#176817
Fix CUDA allocator SIOF causing torch.accelerator API failures#176817jmswen wants to merge 1 commit intopytorch:mainfrom
torch.accelerator API failures#176817Conversation
Summary: The `CUDACachingAllocator` (a `DeviceAllocator`) and Caffe2's legacy `DefaultCUDAAllocator` (a plain `Allocator`) both registered for `DeviceType::CUDA` at priority 0. Since `SetAllocator` uses `>=` comparison, whichever static initializer ran last would win. When the legacy allocator won the race, `dynamic_cast<DeviceAllocator*>` in `getDeviceAllocator()` would fail, crashing `torch.accelerator.empty_cache()` and other `torch.accelerator` APIs. Fix by bumping `CUDACachingAllocator`'s registration priority to 1 so it always takes precedence over the legacy Caffe2 allocator regardless of static initialization order. Test Plan: ``` buck test fbcode//mode/opt fbcode//vllm/omni:test_kernels_rotary_embedding -- --exact 'fbcode//vllm/omni:test_kernels_rotary_embedding - test_rotary_embedding.py::test_rotary_embedding_opcheck[False-False-1024-108-32-True-11-cuda]' ``` Previously: 1 passed, 1 error (`RuntimeError` during teardown) Now: 2 passed, 0 errors Differential Revision: D95703075
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176817
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1cbe619 with merge base 2b8b4ff ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
torch.accelerator API failures
torch.accelerator API failurestorch.accelerator API failures
|
@pytorchbot label "topic: not user facing" |
|
@albanD Another approach that I think is functionally quite similar is simply to not |
|
@jmswen can you lower the priority of the caffe2 default allocator? |
|
@zou3519 Unfortunately not since the priority is a |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
The
CUDACachingAllocator(aDeviceAllocator) and Caffe2's legacyDefaultCUDAAllocator(a plainAllocator) both registered forDeviceType::CUDAat priority 0. SinceSetAllocatoruses>=comparison,whichever static initializer ran last would win. When the legacy
allocator won the race,
dynamic_cast<DeviceAllocator*>ingetDeviceAllocator()would fail, crashingtorch.accelerator.empty_cache()and other
torch.acceleratorAPIs. To be clear, this is not an issue inpure OSS PyTorch, where the Caffe2 legacy CUAD allocator does not exist.
Fix by bumping
CUDACachingAllocator's registration priority to 1 so italways takes precedence over the legacy Caffe2 allocator regardless of
static initialization order.
This SIOF surfaced recently in vLLM after some code was generalized to use
torch.accelerator.empty_cache()instead oftorch.cuda.empty_cache()invllm-project/vllm#30681.
Test Plan:
Previously: 1 passed, 1 error (
RuntimeErrorduring teardown)Now: 2 passed, 0 errors
Errors/stack traces like the following are resolved after this change:
Differential Revision: D95703075