-
Notifications
You must be signed in to change notification settings - Fork 27.4k
[RFC] Intel GPU Inductor backend upstreaming #114856
Description
🚀 The feature, motivation and pitch
Motivation
As the RFC Intel GPU Upstreaming mentioned, to ensure torch.compile to support Intel GPU, it’s crucial to provide the Intel GPU backend for Inductor.
Design
The existing Triton backend within Inductor already supports GPUs, including our integration enabling Triton to function with Intel GPUs. Consequently, extending Inductor to include support for Intel GPUs becomes a streamlined process by leveraging the foundation of the current Triton backend. Only minimal code design and changes would be required in the Inductor codebase itself to add Intel GPU support. The design or changes contain 3 components as follows:
- Backend registration.
- Code generation.
- Graph fusion.
1. Backend registration
Inductor has provided a clear and simple mechanism for backend integration by registering two essential classes at runtime - BaseScheduling and WrapperCodegen:
BaseSchedulingis the interface for kernel code generation.WrapperCodegenfor generating wrapper code that glue, compile, and bench the kernel.
For Intel GPU backend, we will reuse WrapperCodegen and TritonScheduling as we have enabled Triton to support Intel GPU, and then register to Inductor through the register_backend_for_device Inductor interface.
register_backend_for_device("xpu", TritonScheduling, WrapperCodeGen)2. Codegen
From the design perspective, WrapperCodegen and TritonScheduling are device-agnostic to generate Python wrapper code and Triton kernel code. Regarding the detailed implementation, there are still some device-bias codes. For instance, WrapperCodegen invariably embeds the synchronize() literal directly into the kernel code, irrespective of whether the current device backend actually supports this feature.
To address these frictions, we intend to provide a general design for device backend specific implementation. We would like to abstract the device backend bias code generation into a common class interface like DeviceOpOverrides and leave the flexibility to the device backends in their inherited class. Shown in the following graph:
Besides the device bias codes in the generated wrapper and kernel code, Inductor relies on some particular runtime functions per device during generation process. Take CachingAutotuner in triton_heuristic.py as an example, it invokes current_device() and synchronize() directly by hard-coding device type.
Since the general device interface has been ready on Dynamo side, it’s straightforward to generalize the device-specific code by utilizing the general device interface. The devices biases code can be revised as:
torch_device = get_interface_for_device(device_type)
with torch_device.device(compile_meta["device"]):
torch_device.synchronize(torch_device.current_device())3. Graph fusion
We will extend the CPU fusion patterns defined in fx_passes/mkldnn_fusion.py with minimal code change and reuse them for Intel GPU.
Alternatives
No response
Additional context
No response
cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
