[RFC] Intel GPU Runtime Upstreaming for Allocator

# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), we will elaborate on our plan and design for upstreaming `Allocator` here. In PyTorch, `Allocator` is used to manage the memory pool to reduce memory footprint and system allocation/deallocation overhead. The algorithm of the memory caching mechanism in PyTorch is universal for any type of device. However, the current implementation in PyTorch is designed specifically for CUDA. To share the mechanism with other backends, we propose a design to generalize the memory caching mechanism device-agnostic.

# Design
## Device Allocator
In the current design, CUDA uses a `CUDAAllocator` class, inherited from `c10::Allocator`, as an interface. Its derivative class `NativeCachingAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. We can visualize this design as below.

<img width="253" alt="image" src="https://github.com/user-attachments/assets/2660c1ec-0cd9-46e0-ab42-b44867b7a639">


Essentially, the caching mechanism is implemented in `DeviceCachingAllocator`. And `NativeCachingAllocator` is a manager associated with device/stream runtime. A flexible design is necessary to generalize the `DeviceCachingAllocator` and `NativeCachingAllocator` device-agnostic. If available, we can share the most code between CUDA, XPU, and even other out-of-tree backends.
For this upstream, we only focus on the support of caching allocator on Intel GPU. To realize the purpose of reusing caching mechanism code, we generalize two abstractions on two levels.
- `DeviceCachingAllocator` is an abstract that contains the base logic of the caching mechanism of per-device. Meanwhile, each type of device is capable of supporting a special feature, like CUDA Graph. To satisfy this condition, `CUDADeviceCachingAllocator` inheriting from `DeviceCachingAllocator` has an opportunity to override the basic logic to expand the required feature.
- `NativeCahingAllocator` is the other abstract as a manager to handle `DeviceCachingAllocator`, which is the per-device implementation of the base caching mechanism. `XPUNativeCachingAllocator` inherits from `NativeCachingAllocator` as an interface that can fetch the corresponding `DeviceCachingAllocator` via the member function `get_allocator()`. Some runtime functionality is required in `XPUNativeCachingAllocator` to control which stream and device the allocated memory is used.

We can illustrate our design below.

<img width="862" alt="image" src="https://github.com/user-attachments/assets/f02e6adb-f3e8-46f0-9bd6-7ca9ed1840fb">


This design is flexible to extend the special feature for each type of device. It is easy to equip a new caching allocator on any other backend to leverage the base caching mechanism.

## Host Allocator
Host `Allocator` is simpler than Device `Allocator`. We can illustrate its design below.

<img width="251" alt="image" src="https://github.com/user-attachments/assets/92a0a43b-4301-4dfc-bc98-30781ad24960">


For Host `Allocator`, we could adopt a similar design as Device `Allocator`. As is shown in the below picture, this design can easily extend other backends to leverage the host caching mechanism.

<img width="867" alt="image" src="https://github.com/user-attachments/assets/1d943814-3f49-4455-8e2f-c117b98aa97b">


# Additonal Context
We're going to implement our design gradually. A non-negligible effort and considerable time are necessary because the current implementation is strongly coupled with CUDA. We expect to separate our `Allocator` into two stages:
1. upstream the key functionality of `Allocator` dedicated for XPU to PyTorch, implementing the device `Allocator` in approximately 200 lines and the host `Allocator` in another 150 lines.
2. following our design prepare to generalize `Allocator` in parallel.

Besides caching allocator, there is another allocator with an async behavior in PyTorch, like `CudaMallocAsyncAllocator`. We are categorizing features, such as async allocator and CUDA Graph, as optional features since they may not be supported by other backends. We will generalize it in the next step. 


cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Intel GPU Runtime Upstreaming for Allocator #116322

Motivation

Design

Device Allocator

Host Allocator

Additonal Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Intel GPU Runtime Upstreaming for Allocator #116322

Description

Motivation

Design

Device Allocator

Host Allocator

Additonal Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions