[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing by Conradzz · Pull Request #180145 · pytorch/pytorch

Conradzz · 2026-04-11T20:41:51Z

Summary

On discrete Intel GPUs using the xe kernel driver (Battlemage/Xe2 and later),
sycl::aligned_alloc_device creates device memory through a DMA-buf/TTM code
path that allocates a 1:1 host-side memory mirror for every device allocation.
On a 32 GB card this can exhaust all available host RAM.

This PR replaces the SYCL device allocation path with direct Level Zero
zeMemAllocDevice / zeMemFree calls, which use the SVM/P2P allocation path
and do not create host-side mirrors. The same approach was validated in
llama.cpp (ggml-org/llama.cpp#21597).

What's in the patch

121 lines across 5 files:

File	Change
`ATenLevelZero.h`	Add `zeMemAllocDevice` and `zeMemFree` to the Level Zero function pointer table
`LazyLevelZero.cpp`	Add 6-arg lazy stub macro + stubs for the two new functions
`XPUCachingAllocator.h`	Declare callback function types and registration API
`XPUCachingAllocator.cpp`	Check for registered callbacks in `allocPrimitive` / `deletePrimitive`, fall back to SYCL
`XPUHooks.cpp`	Implement Level Zero alloc/free, register during `XPUHooks::init()`

The c10 layer cannot include ATen headers, so the Level Zero implementation
lives in ATen and is registered into c10 via function pointer callbacks at init
time.

Measured impact (Arc B70, 32 GB GDDR6, xe driver 1.14.37435+1)

Metric	Before (SYCL path)	After (Level Zero path)
Host RAM consumed per GB of VRAM allocated	~1 GB	0
`MemAvailable` after allocating 24 GB on device	~4 GB	~28 GB

Opt-out

Set PYTORCH_XPU_ALLOC_LEVEL_ZERO=0 to revert to the SYCL allocation path.
Enabled by default on Linux. Windows is unaffected (no xe kernel driver).

Test plan

torch.empty(N, device='xpu') + tensor.zero_() — allocation and kernel dispatch
BF16 matmul, Conv2d, LayerNorm, SDPA — all pass on Arc B70
torch.xpu.memory_allocated() reports correct values
GPU ↔ CPU transfers round-trip with zero diff
PYTORCH_XPU_ALLOC_LEVEL_ZERO=0 falls back to SYCL path cleanly

On discrete Intel GPUs (Xe2 and later), the xe kernel driver creates a 1:1 host-side memory mirror for every sycl::aligned_alloc_device call via the DMA-buf/TTM path. This can consume all available host RAM when the device has large VRAM (e.g. 32 GB on Arc B-series). Replace the SYCL allocation path with direct Level Zero zeMemAllocDevice calls, which use the SVM/P2P path and do not create host-side mirrors. The implementation uses a callback registration pattern so that c10 (which cannot depend on ATen headers) delegates to ATen at init time. Opt out by setting PYTORCH_XPU_ALLOC_LEVEL_ZERO=0. Linux only; Windows is unaffected (no xe kernel driver).

…queries The oneAPI Unified Runtime Level Zero adapter dereferences a NULL extension function pointer (ze_device_vector_width_properties_ext_t) when querying half_fp_config, preferred_vector_width_*, or native_vector_width_* on Battlemage G31 devices. This causes a segfault on the first kernel dispatch since getDeviceProperties() is called before every kernel launch. Replace the AT_FORALL_XPU_DEVICE_PROPERTIES macro expansion with individual property assignments, using safe defaults for the affected queries.

pytorch-bot · 2026-04-11T20:41:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/180145

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-04-11T20:41:57Z

The committers listed above are authorized under a signed CLA.

✅ login: Conradzz / name: Aelryic & Nathan (d00c2df, d9e8160)
✅ login: Conradzz / name: Nathan Sharlaw (321eb95)

pytorch-bot · 2026-04-11T20:41:59Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Conradzz · 2026-04-11T20:52:58Z

@pytorchbot label "release notes: xpu"

Conradzz · 2026-04-11T20:54:46Z

@pytorchbot label "topic: not user facing"

gujinghui · 2026-04-12T08:51:58Z

@Conradzz Does this happen on multi-GPU system only? Or, it happens on both single and multi-GPU systems?

Conradzz · 2026-04-12T22:52:50Z

@Conradzz Does this happen on multi-GPU system only? Or, it happens on both single and multi-GPU systems?

I wish I had two available to let you know, however unfortunately I cannot answer that question.

gujinghui · 2026-04-13T07:04:41Z

Thanks for the PR. I’ll need some time to review it and confirm. I’ll update here if anything comes up.

guangyey · 2026-04-13T10:27:33Z

Hi @Conradzz
I can't reproduce the host memory shadow behavior. I use the following script running on BMG580, and the results are expected.

import torch

cache = []

def get_mem_available():
    with open("/proc/meminfo") as f:
        for line in f:
            if line.startswith("MemAvailable"):
                return int(line.split()[1]) / 1024 / 1024  # GB

def allocate_2G_cpu_tensor():
    global cache
    before = get_mem_available()
    a = torch.zeros(1024*1024*512, device='cpu')
    cache.append(a)
    after = get_mem_available()
    print(f"Allocate 2GB on CPU, Before {before:.2f}GB, After {after:.2f}GB, Changed {before-after:.2f}GB")


def allocate_2G_gpu_tensor():
    global cache
    before = get_mem_available()
    a = torch.zeros(1024*1024*512, device='xpu')
    cache.append(a)
    after = get_mem_available()
    print(f"Allocate 2GB on GPU, Before {before:.2f}GB, After {after:.2f}GB, Changed {before-after:.2f}GB")


print(torch.xpu.get_device_properties())
allocate_2G_cpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_gpu_tensor()
allocate_2G_cpu_tensor()

Could you please provide your reproducer, or anything I am missing?

Conradzz · 2026-04-14T01:17:53Z

Dug into this more — I think the driver upgrade was the actual fix, not the allocation path change. The process I measured had been running on compute-runtime 25.18, and the patched binary loaded 26.09. Two things changed at once.

Tested both paths on 26.09 with 264 allocations (4.7GB) — zero host RAM shadow either way. Can't reproduce it. Closing this out, thanks for the review.

Conradzz added 2 commits April 11, 2026 20:40

Conradzz requested review from EikanWang and gujinghui as code owners April 11, 2026 20:41

pytorchbot added the open source label Apr 11, 2026

pytorch-bot Bot added the release notes: xpu release notes category label Apr 11, 2026

github-project-automation Bot added this to PyTorch Intel Apr 11, 2026

pytorch-bot Bot added the topic: not user facing topic category label Apr 11, 2026

Conradzz force-pushed the fix-xpu-battlemage-alloc branch from 8b24da9 to f55a390 Compare April 12, 2026 22:56

style: apply clang-format to fix CI lint warnings

321eb95

Conradzz force-pushed the fix-xpu-battlemage-alloc branch from f55a390 to 321eb95 Compare April 12, 2026 23:18

EikanWang requested a review from guangyey April 13, 2026 05:43

Conradzz closed this Apr 14, 2026

NeoZhangJianyu mentioned this pull request Apr 15, 2026

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations ggml-org/llama.cpp#21597

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing#180145

[XPU] Use Level Zero zeMemAllocDevice to avoid host memory shadowing#180145
Conradzz wants to merge 3 commits into
pytorch:mainfrom
Conradzz:fix-xpu-battlemage-alloc

Conradzz commented Apr 11, 2026

Uh oh!

pytorch-bot Bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 11, 2026

Uh oh!

Conradzz commented Apr 11, 2026

Uh oh!

Conradzz commented Apr 11, 2026

Uh oh!

gujinghui commented Apr 12, 2026

Uh oh!

Conradzz commented Apr 12, 2026 •

edited

Loading

Uh oh!

gujinghui commented Apr 13, 2026

Uh oh!

guangyey commented Apr 13, 2026 •

edited

Loading

Uh oh!

Conradzz commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Conradzz commented Apr 11, 2026

Summary

What's in the patch

Measured impact (Arc B70, 32 GB GDDR6, xe driver 1.14.37435+1)

Opt-out

Test plan

Uh oh!

pytorch-bot Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/180145

❗ 1 Active SEVs

Uh oh!

linux-foundation-easycla Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 11, 2026

This PR needs a release notes: label

Uh oh!

Conradzz commented Apr 11, 2026

Uh oh!

Conradzz commented Apr 11, 2026

Uh oh!

gujinghui commented Apr 12, 2026

Uh oh!

Conradzz commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gujinghui commented Apr 13, 2026

Uh oh!

guangyey commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Conradzz commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot Bot commented Apr 11, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Apr 11, 2026 •

edited

Loading

This PR needs a `release notes:` label

Conradzz commented Apr 12, 2026 •

edited

Loading

guangyey commented Apr 13, 2026 •

edited

Loading