[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing#180145
[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing#180145Conradzz wants to merge 3 commits into
Conversation
On discrete Intel GPUs (Xe2 and later), the xe kernel driver creates a 1:1 host-side memory mirror for every sycl::aligned_alloc_device call via the DMA-buf/TTM path. This can consume all available host RAM when the device has large VRAM (e.g. 32 GB on Arc B-series). Replace the SYCL allocation path with direct Level Zero zeMemAllocDevice calls, which use the SVM/P2P path and do not create host-side mirrors. The implementation uses a callback registration pattern so that c10 (which cannot depend on ATen headers) delegates to ATen at init time. Opt out by setting PYTORCH_XPU_ALLOC_LEVEL_ZERO=0. Linux only; Windows is unaffected (no xe kernel driver).
…queries The oneAPI Unified Runtime Level Zero adapter dereferences a NULL extension function pointer (ze_device_vector_width_properties_ext_t) when querying half_fp_config, preferred_vector_width_*, or native_vector_width_* on Battlemage G31 devices. This causes a segfault on the first kernel dispatch since getDeviceProperties() is called before every kernel launch. Replace the AT_FORALL_XPU_DEVICE_PROPERTIES macro expansion with individual property assignments, using safe defaults for the affected queries.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/180145
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@pytorchbot label "release notes: xpu" |
|
@pytorchbot label "topic: not user facing" |
|
@Conradzz Does this happen on multi-GPU system only? Or, it happens on both single and multi-GPU systems? |
I wish I had two available to let you know, however unfortunately I cannot answer that question. |
8b24da9 to
f55a390
Compare
f55a390 to
321eb95
Compare
|
Thanks for the PR. I’ll need some time to review it and confirm. I’ll update here if anything comes up. |
|
Hi @Conradzz import torch
cache = []
def get_mem_available():
with open("/proc/meminfo") as f:
for line in f:
if line.startswith("MemAvailable"):
return int(line.split()[1]) / 1024 / 1024 # GB
def allocate_2G_cpu_tensor():
global cache
before = get_mem_available()
a = torch.zeros(1024*1024*512, device='cpu')
cache.append(a)
after = get_mem_available()
print(f"Allocate 2GB on CPU, Before {before:.2f}GB, After {after:.2f}GB, Changed {before-after:.2f}GB")
def allocate_2G_gpu_tensor():
global cache
before = get_mem_available()
a = torch.zeros(1024*1024*512, device='xpu')
cache.append(a)
after = get_mem_available()
print(f"Allocate 2GB on GPU, Before {before:.2f}GB, After {after:.2f}GB, Changed {before-after:.2f}GB")
print(torch.xpu.get_device_properties())
allocate_2G_cpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_cpu_tensor()
Could you please provide your reproducer, or anything I am missing? |
|
Dug into this more — I think the driver upgrade was the actual fix, not the allocation path change. The process I measured had been running on compute-runtime 25.18, and the patched binary loaded 26.09. Two things changed at once. Tested both paths on 26.09 with 264 allocations (4.7GB) — zero host RAM shadow either way. Can't reproduce it. Closing this out, thanks for the review. |

Summary
On discrete Intel GPUs using the xe kernel driver (Battlemage/Xe2 and later),
sycl::aligned_alloc_devicecreates device memory through a DMA-buf/TTM codepath that allocates a 1:1 host-side memory mirror for every device allocation.
On a 32 GB card this can exhaust all available host RAM.
This PR replaces the SYCL device allocation path with direct Level Zero
zeMemAllocDevice/zeMemFreecalls, which use the SVM/P2P allocation pathand do not create host-side mirrors. The same approach was validated in
llama.cpp (ggml-org/llama.cpp#21597).
What's in the patch
121 lines across 5 files:
ATenLevelZero.hzeMemAllocDeviceandzeMemFreeto the Level Zero function pointer tableLazyLevelZero.cppXPUCachingAllocator.hXPUCachingAllocator.cppallocPrimitive/deletePrimitive, fall back to SYCLXPUHooks.cppXPUHooks::init()The c10 layer cannot include ATen headers, so the Level Zero implementation
lives in ATen and is registered into c10 via function pointer callbacks at init
time.
Measured impact (Arc B70, 32 GB GDDR6, xe driver 1.14.37435+1)
MemAvailableafter allocating 24 GB on deviceOpt-out
Set
PYTORCH_XPU_ALLOC_LEVEL_ZERO=0to revert to the SYCL allocation path.Enabled by default on Linux. Windows is unaffected (no xe kernel driver).
Test plan
torch.empty(N, device='xpu')+tensor.zero_()— allocation and kernel dispatchtorch.xpu.memory_allocated()reports correct valuesPYTORCH_XPU_ALLOC_LEVEL_ZERO=0falls back to SYCL path cleanly