-
Notifications
You must be signed in to change notification settings - Fork 27.4k
[RFC] Intel GPU Runtime Upstreaming for Allocator #116322
Description
Motivation
As mentioned in [RFC] Intel GPU Runtime Upstreaming, we will elaborate on our plan and design for upstreaming Allocator here. In PyTorch, Allocator is used to manage the memory pool to reduce memory footprint and system allocation/deallocation overhead. The algorithm of the memory caching mechanism in PyTorch is universal for any type of device. However, the current implementation in PyTorch is designed specifically for CUDA. To share the mechanism with other backends, we propose a design to generalize the memory caching mechanism device-agnostic.
Design
Device Allocator
In the current design, CUDA uses a CUDAAllocator class, inherited from c10::Allocator, as an interface. Its derivative class NativeCachingAllocator is a manager to handle DeviceCachingAllocator, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. We can visualize this design as below.
Essentially, the caching mechanism is implemented in DeviceCachingAllocator. And NativeCachingAllocator is a manager associated with device/stream runtime. A flexible design is necessary to generalize the DeviceCachingAllocator and NativeCachingAllocator device-agnostic. If available, we can share the most code between CUDA, XPU, and even other out-of-tree backends.
For this upstream, we only focus on the support of caching allocator on Intel GPU. To realize the purpose of reusing caching mechanism code, we generalize two abstractions on two levels.
DeviceCachingAllocatoris an abstract that contains the base logic of the caching mechanism of per-device. Meanwhile, each type of device is capable of supporting a special feature, like CUDA Graph. To satisfy this condition,CUDADeviceCachingAllocatorinheriting fromDeviceCachingAllocatorhas an opportunity to override the basic logic to expand the required feature.NativeCahingAllocatoris the other abstract as a manager to handleDeviceCachingAllocator, which is the per-device implementation of the base caching mechanism.XPUNativeCachingAllocatorinherits fromNativeCachingAllocatoras an interface that can fetch the correspondingDeviceCachingAllocatorvia the member functionget_allocator(). Some runtime functionality is required inXPUNativeCachingAllocatorto control which stream and device the allocated memory is used.
We can illustrate our design below.
This design is flexible to extend the special feature for each type of device. It is easy to equip a new caching allocator on any other backend to leverage the base caching mechanism.
Host Allocator
Host Allocator is simpler than Device Allocator. We can illustrate its design below.
For Host Allocator, we could adopt a similar design as Device Allocator. As is shown in the below picture, this design can easily extend other backends to leverage the host caching mechanism.
Additonal Context
We're going to implement our design gradually. A non-negligible effort and considerable time are necessary because the current implementation is strongly coupled with CUDA. We expect to separate our Allocator into two stages:
- upstream the key functionality of
Allocatordedicated for XPU to PyTorch, implementing the deviceAllocatorin approximately 200 lines and the hostAllocatorin another 150 lines. - following our design prepare to generalize
Allocatorin parallel.
Besides caching allocator, there is another allocator with an async behavior in PyTorch, like CudaMallocAsyncAllocator. We are categorizing features, such as async allocator and CUDA Graph, as optional features since they may not be supported by other backends. We will generalize it in the next step.
cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10
Metadata
Metadata
Assignees
Labels
Type
Projects
Status



