Skip to content

[RFC] Intel GPU Runtime Upstreaming for Allocator #116322

@guangyey

Description

@guangyey

Motivation

As mentioned in [RFC] Intel GPU Runtime Upstreaming, we will elaborate on our plan and design for upstreaming Allocator here. In PyTorch, Allocator is used to manage the memory pool to reduce memory footprint and system allocation/deallocation overhead. The algorithm of the memory caching mechanism in PyTorch is universal for any type of device. However, the current implementation in PyTorch is designed specifically for CUDA. To share the mechanism with other backends, we propose a design to generalize the memory caching mechanism device-agnostic.

Design

Device Allocator

In the current design, CUDA uses a CUDAAllocator class, inherited from c10::Allocator, as an interface. Its derivative class NativeCachingAllocator is a manager to handle DeviceCachingAllocator, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. We can visualize this design as below.

image

Essentially, the caching mechanism is implemented in DeviceCachingAllocator. And NativeCachingAllocator is a manager associated with device/stream runtime. A flexible design is necessary to generalize the DeviceCachingAllocator and NativeCachingAllocator device-agnostic. If available, we can share the most code between CUDA, XPU, and even other out-of-tree backends.
For this upstream, we only focus on the support of caching allocator on Intel GPU. To realize the purpose of reusing caching mechanism code, we generalize two abstractions on two levels.

  • DeviceCachingAllocator is an abstract that contains the base logic of the caching mechanism of per-device. Meanwhile, each type of device is capable of supporting a special feature, like CUDA Graph. To satisfy this condition, CUDADeviceCachingAllocator inheriting from DeviceCachingAllocator has an opportunity to override the basic logic to expand the required feature.
  • NativeCahingAllocator is the other abstract as a manager to handle DeviceCachingAllocator, which is the per-device implementation of the base caching mechanism. XPUNativeCachingAllocator inherits from NativeCachingAllocator as an interface that can fetch the corresponding DeviceCachingAllocator via the member function get_allocator(). Some runtime functionality is required in XPUNativeCachingAllocator to control which stream and device the allocated memory is used.

We can illustrate our design below.

image

This design is flexible to extend the special feature for each type of device. It is easy to equip a new caching allocator on any other backend to leverage the base caching mechanism.

Host Allocator

Host Allocator is simpler than Device Allocator. We can illustrate its design below.

image

For Host Allocator, we could adopt a similar design as Device Allocator. As is shown in the below picture, this design can easily extend other backends to leverage the host caching mechanism.

image

Additonal Context

We're going to implement our design gradually. A non-negligible effort and considerable time are necessary because the current implementation is strongly coupled with CUDA. We expect to separate our Allocator into two stages:

  1. upstream the key functionality of Allocator dedicated for XPU to PyTorch, implementing the device Allocator in approximately 200 lines and the host Allocator in another 150 lines.
  2. following our design prepare to generalize Allocator in parallel.

Besides caching allocator, there is another allocator with an async behavior in PyTorch, like CudaMallocAsyncAllocator. We are categorizing features, such as async allocator and CUDA Graph, as optional features since they may not be supported by other backends. We will generalize it in the next step.

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: intelSpecific to x86 architecturetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions