Skip to content

[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing#180145

Closed
Conradzz wants to merge 3 commits into
pytorch:mainfrom
Conradzz:fix-xpu-battlemage-alloc
Closed

[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing#180145
Conradzz wants to merge 3 commits into
pytorch:mainfrom
Conradzz:fix-xpu-battlemage-alloc

Conversation

@Conradzz

Copy link
Copy Markdown

Summary

On discrete Intel GPUs using the xe kernel driver (Battlemage/Xe2 and later),
sycl::aligned_alloc_device creates device memory through a DMA-buf/TTM code
path that allocates a 1:1 host-side memory mirror for every device allocation.
On a 32 GB card this can exhaust all available host RAM.

This PR replaces the SYCL device allocation path with direct Level Zero
zeMemAllocDevice / zeMemFree calls, which use the SVM/P2P allocation path
and do not create host-side mirrors. The same approach was validated in
llama.cpp (ggml-org/llama.cpp#21597).

What's in the patch

121 lines across 5 files:

File Change
ATenLevelZero.h Add zeMemAllocDevice and zeMemFree to the Level Zero function pointer table
LazyLevelZero.cpp Add 6-arg lazy stub macro + stubs for the two new functions
XPUCachingAllocator.h Declare callback function types and registration API
XPUCachingAllocator.cpp Check for registered callbacks in allocPrimitive / deletePrimitive, fall back to SYCL
XPUHooks.cpp Implement Level Zero alloc/free, register during XPUHooks::init()

The c10 layer cannot include ATen headers, so the Level Zero implementation
lives in ATen and is registered into c10 via function pointer callbacks at init
time.

Measured impact (Arc B70, 32 GB GDDR6, xe driver 1.14.37435+1)

Metric Before (SYCL path) After (Level Zero path)
Host RAM consumed per GB of VRAM allocated ~1 GB 0
MemAvailable after allocating 24 GB on device ~4 GB ~28 GB

Opt-out

Set PYTORCH_XPU_ALLOC_LEVEL_ZERO=0 to revert to the SYCL allocation path.
Enabled by default on Linux. Windows is unaffected (no xe kernel driver).

Test plan

  • torch.empty(N, device='xpu') + tensor.zero_() — allocation and kernel dispatch
  • BF16 matmul, Conv2d, LayerNorm, SDPA — all pass on Arc B70
  • torch.xpu.memory_allocated() reports correct values
  • GPU ↔ CPU transfers round-trip with zero diff
  • PYTORCH_XPU_ALLOC_LEVEL_ZERO=0 falls back to SYCL path cleanly

On discrete Intel GPUs (Xe2 and later), the xe kernel driver creates a
1:1 host-side memory mirror for every sycl::aligned_alloc_device call
via the DMA-buf/TTM path. This can consume all available host RAM when
the device has large VRAM (e.g. 32 GB on Arc B-series).

Replace the SYCL allocation path with direct Level Zero zeMemAllocDevice
calls, which use the SVM/P2P path and do not create host-side mirrors.
The implementation uses a callback registration pattern so that c10
(which cannot depend on ATen headers) delegates to ATen at init time.

Opt out by setting PYTORCH_XPU_ALLOC_LEVEL_ZERO=0. Linux only; Windows
is unaffected (no xe kernel driver).
…queries

The oneAPI Unified Runtime Level Zero adapter dereferences a NULL
extension function pointer (ze_device_vector_width_properties_ext_t)
when querying half_fp_config, preferred_vector_width_*, or
native_vector_width_* on Battlemage G31 devices. This causes a segfault
on the first kernel dispatch since getDeviceProperties() is called
before every kernel launch.

Replace the AT_FORALL_XPU_DEVICE_PROPERTIES macro expansion with
individual property assignments, using safe defaults for the affected
queries.
@pytorch-bot

pytorch-bot Bot commented Apr 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/180145

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla

linux-foundation-easycla Bot commented Apr 11, 2026

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Conradzz / name: Aelryic & Nathan (d00c2df, d9e8160)
  • ✅ login: Conradzz / name: Nathan Sharlaw (321eb95)

@pytorch-bot

pytorch-bot Bot commented Apr 11, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@Conradzz

Copy link
Copy Markdown
Author

@pytorchbot label "release notes: xpu"

@pytorch-bot pytorch-bot Bot added the release notes: xpu release notes category label Apr 11, 2026
@Conradzz

Copy link
Copy Markdown
Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Apr 11, 2026
@gujinghui

Copy link
Copy Markdown
Collaborator

@Conradzz Does this happen on multi-GPU system only? Or, it happens on both single and multi-GPU systems?

@Conradzz

Conradzz commented Apr 12, 2026

Copy link
Copy Markdown
Author

@Conradzz Does this happen on multi-GPU system only? Or, it happens on both single and multi-GPU systems?

I wish I had two available to let you know, however unfortunately I cannot answer that question.

@Conradzz Conradzz force-pushed the fix-xpu-battlemage-alloc branch from 8b24da9 to f55a390 Compare April 12, 2026 22:56
@Conradzz Conradzz force-pushed the fix-xpu-battlemage-alloc branch from f55a390 to 321eb95 Compare April 12, 2026 23:18
@EikanWang EikanWang requested a review from guangyey April 13, 2026 05:43
@gujinghui

Copy link
Copy Markdown
Collaborator

Thanks for the PR. I’ll need some time to review it and confirm. I’ll update here if anything comes up.

@guangyey

guangyey commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

Hi @Conradzz
I can't reproduce the host memory shadow behavior. I use the following script running on BMG580, and the results are expected.

import torch

cache = []

def get_mem_available():
    with open("/proc/meminfo") as f:
        for line in f:
            if line.startswith("MemAvailable"):
                return int(line.split()[1]) / 1024 / 1024  # GB

def allocate_2G_cpu_tensor():
    global cache
    before = get_mem_available()
    a = torch.zeros(1024*1024*512, device='cpu')
    cache.append(a)
    after = get_mem_available()
    print(f"Allocate 2GB on CPU, Before {before:.2f}GB, After {after:.2f}GB, Changed {before-after:.2f}GB")


def allocate_2G_gpu_tensor():
    global cache
    before = get_mem_available()
    a = torch.zeros(1024*1024*512, device='xpu')
    cache.append(a)
    after = get_mem_available()
    print(f"Allocate 2GB on GPU, Before {before:.2f}GB, After {after:.2f}GB, Changed {before-after:.2f}GB")


print(torch.xpu.get_device_properties())
allocate_2G_cpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_cpu_tensor()
image

Could you please provide your reproducer, or anything I am missing?

@Conradzz

Copy link
Copy Markdown
Author

Dug into this more — I think the driver upgrade was the actual fix, not the allocation path change. The process I measured had been running on compute-runtime 25.18, and the patched binary loaded 26.09. Two things changed at once.

Tested both paths on 26.09 with 264 allocations (4.7GB) — zero host RAM shadow either way. Can't reproduce it. Closing this out, thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants