[RFC] Intel GPU Inductor backend upstreaming

### 🚀 The feature, motivation and pitch

### Motivation
As the RFC [Intel GPU Upstreaming](https://github.com/pytorch/pytorch/issues/114723) mentioned, to ensure `torch.compile` to support Intel GPU, it’s crucial to provide the Intel GPU backend for Inductor.

### Design
The existing Triton backend within Inductor already supports GPUs, including our integration enabling Triton to function with Intel GPUs. Consequently, extending Inductor to include support for Intel GPUs becomes a streamlined process by leveraging the foundation of the current Triton backend. Only minimal code design and changes would be required in the Inductor codebase itself to add Intel GPU support. The design or changes contain 3 components as follows:
1.	Backend registration.
2.	Code generation.
3.	Graph fusion.

#### 1. Backend registration

![image](https://github.com/intel-innersource/frameworks.ai.pytorch.private-gpu/assets/2347459/cd85c770-12bb-4744-aa2e-eb99a47d8a3f)

Inductor has provided a clear and simple mechanism for backend integration by registering two essential classes at runtime - `BaseScheduling` and `WrapperCodegen`:
- `BaseScheduling` is the interface for kernel code generation.
- `WrapperCodegen` for generating wrapper code that glue, compile, and bench the kernel. 

For Intel GPU backend, we will reuse `WrapperCodegen` and `TritonScheduling` as we have enabled Triton to support Intel GPU, and then register to Inductor through the `register_backend_for_device` Inductor interface.

```python
register_backend_for_device("xpu", TritonScheduling, WrapperCodeGen)
```

#### 2. Codegen

From the design perspective, `WrapperCodegen` and `TritonScheduling` are device-agnostic to generate Python wrapper code and Triton kernel code. Regarding the detailed implementation, there are still some device-bias codes. For instance, WrapperCodegen invariably embeds the [`synchronize()`](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/wrapper.py#L460) literal directly into the kernel code, irrespective of whether the current device backend actually supports this feature.
To address these frictions, we intend to provide a general design for device backend specific implementation. We would like to abstract the device backend bias code generation into a common class interface like `DeviceOpOverrides` and leave the flexibility to the device backends in their inherited class. Shown in the following graph:


![image](https://github.com/pytorch/pytorch/assets/2347459/1a618f8a-aa5a-42a1-8a65-a9d600ff0a38)

Besides the device bias codes in the generated wrapper and kernel code, Inductor relies on some particular runtime functions per device during generation process. Take `CachingAutotuner` in triton_heuristic.py as an example, it invokes[ `current_device()`](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/triton_heuristics.py#L303) and [`synchronize()`](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/triton_heuristics.py#L303) directly by hard-coding device type. 
Since the[ general device interface](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/device_interface.py#L34) has been ready on Dynamo side, it’s straightforward to generalize the device-specific code by utilizing the general device interface. The devices biases code can be revised as:

```python
torch_device = get_interface_for_device(device_type) 
with torch_device.device(compile_meta["device"]):
        torch_device.synchronize(torch_device.current_device())
```

#### 3. Graph fusion

We will extend the CPU fusion patterns defined in `fx_passes/mkldnn_fusion.py` with minimal code change and reuse them for Intel GPU. 

### Alternatives

_No response_

### Additional context

_No response_

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Intel GPU Inductor backend upstreaming #114856

🚀 The feature, motivation and pitch

Motivation

Design

1. Backend registration

2. Codegen

3. Graph fusion

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Intel GPU Inductor backend upstreaming #114856

Description

🚀 The feature, motivation and pitch

Motivation

Design

1. Backend registration

2. Codegen

3. Graph fusion

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions