Skip to content

[RFC] Intel GPU Inductor backend upstreaming #114856

@etaf

Description

@etaf

🚀 The feature, motivation and pitch

Motivation

As the RFC Intel GPU Upstreaming mentioned, to ensure torch.compile to support Intel GPU, it’s crucial to provide the Intel GPU backend for Inductor.

Design

The existing Triton backend within Inductor already supports GPUs, including our integration enabling Triton to function with Intel GPUs. Consequently, extending Inductor to include support for Intel GPUs becomes a streamlined process by leveraging the foundation of the current Triton backend. Only minimal code design and changes would be required in the Inductor codebase itself to add Intel GPU support. The design or changes contain 3 components as follows:

  1. Backend registration.
  2. Code generation.
  3. Graph fusion.

1. Backend registration

image

Inductor has provided a clear and simple mechanism for backend integration by registering two essential classes at runtime - BaseScheduling and WrapperCodegen:

  • BaseScheduling is the interface for kernel code generation.
  • WrapperCodegen for generating wrapper code that glue, compile, and bench the kernel.

For Intel GPU backend, we will reuse WrapperCodegen and TritonScheduling as we have enabled Triton to support Intel GPU, and then register to Inductor through the register_backend_for_device Inductor interface.

register_backend_for_device("xpu", TritonScheduling, WrapperCodeGen)

2. Codegen

From the design perspective, WrapperCodegen and TritonScheduling are device-agnostic to generate Python wrapper code and Triton kernel code. Regarding the detailed implementation, there are still some device-bias codes. For instance, WrapperCodegen invariably embeds the synchronize() literal directly into the kernel code, irrespective of whether the current device backend actually supports this feature.
To address these frictions, we intend to provide a general design for device backend specific implementation. We would like to abstract the device backend bias code generation into a common class interface like DeviceOpOverrides and leave the flexibility to the device backends in their inherited class. Shown in the following graph:

image

Besides the device bias codes in the generated wrapper and kernel code, Inductor relies on some particular runtime functions per device during generation process. Take CachingAutotuner in triton_heuristic.py as an example, it invokes current_device() and synchronize() directly by hard-coding device type.
Since the general device interface has been ready on Dynamo side, it’s straightforward to generalize the device-specific code by utilizing the general device interface. The devices biases code can be revised as:

torch_device = get_interface_for_device(device_type) 
with torch_device.device(compile_meta["device"]):
        torch_device.synchronize(torch_device.current_device())

3. Graph fusion

We will extend the CPU fusion patterns defined in fx_passes/mkldnn_fusion.py with minimal code change and reuse them for Intel GPU.

Alternatives

No response

Additional context

No response

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.module: inductormodule: intelSpecific to x86 architectureoncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions