[RFC]Intel GPU oneDNN Upstreaming

### 🚀 The feature, motivation and pitch

# Motivation
Intel GPUs could significantly improve the workload performance. As described in [[RFC] Intel GPU Upstreaming](https://github.com/pytorch/pytorch/issues/114723), we intend to leverage Intel's advancements in GPU technology to enhance PyTorch's performance and versatility. [oneDNN library (Intel® oneAPI Deep Neural Network Library)](https://github.com/oneapi-src/oneDNN)  already has mature support for Intel GPUs and it is a key component in the Intel GPU software stack. 

This RFC intends to introduce the oneDNN GPU support in PyTorch, including Conv, GEMM and other highly optimized kernels.  With the support, users could deliver high performance on Intel GPUs more easily. Moreover, Intel GPU support for PyTorch could also evolve quickly with the development of the community.

# Proposal for Library Structure

PyTorch has integrated oneDNN as a git submodule for CPU support. For Intel GPU support, we will reuse the same oneDNN codebase. To minimize the integration effort, we intend to separately build oneDNN as a static library for Intel CPU and GPU, respectively.

Compiling CPU source code in oneDNN at `pytorch/third_party/ideep/mkl-dnn` generate a static library `libdnnl.a`. It is linked to `libtorch_cpu.so`.  This is the existing behavior in PyTorch, no modification will be introduced for this side.

Compiling GPU source codes in oneDNN would also generates a `libdnnl.a` static library.  GPU and CPU can share the same oneDNN codebase in `pytorch/third_party/ideep/mkl-dnn` to minimize the integration effort. We would make the `libdnnl.a` for GPU  be linked to `libtorch_xpu.so`.

# Operator Scope

We suggest that the operator coverage grows in a step-by-step way. The coverage would be driven by the dynamo benchmarks-HF, TIMM and TorchBench. The high-priority ops can be supported first. For example, `Conv` & `GEMM` related ops should have the highest priority. The ATen operators like `_convolution`,  `linear`, `addmm` can be firstly implemented. The implementation can be registered at `XPU` dispatchkey for Intel GPUs which is in PyTorch already.  With these operators, most benchmark models can run on Intel GPUs.

Then, optimized fusion operators like `convolution_pointwise`, `linear_pointwise`, `convolution_pointwise_binary` etc. can be supported. These operators will share the same schema as the ones already registered in `MkldnnCPU` dispatchkey, and they are registered as the implementation for `XPU` dispatchkey. The entry points for these series of operators are `convolution_pointwise` and `linear_pointwise`. Fusion ops are beneficial for delivering better performance at graph mode, such as `torch.compile`.  

At last, other operators can then be supported gradually (not target PyTorch 2.5), for example, `baddmm`, `deconvolution`.  We can gradually complete the coverage based on the request.

| Description | Details | Priority |
| :-----| :---- | :----: |
| Aten ops | _convolution,  convolution_backward, linear, addmm, mm, bmm | P0 |
| Fusion Ops | convoultion_pointwise, convolution_pointwise_binary , convolution_pointwise_binary_, linear_pointwise, linear_pointwise_binary | P1 |
| Ops used less frequently | addbmm, deconvolution, baddmm, addbmm...  | P2 |

We propose to put GPU code in the same directory as CPU, as the Figure below shows. The directory is, namely, 

*	CPU: `aten/src/ATen/native/mkldnn/cpu`
*	GPU: `aten/src/ATen/native/mkldnn/xpu`

![file_structure](https://github.com/pytorch/pytorch/assets/12457857/c66a65fc-e18c-479b-98ed-b8d03231fcb2)

Source codes (Conv.cpp, Linear.cpp) at this directory would implement the operators above, accompanied by their registration. Another directory called `detail` contains a series of glue codes calling oneDNN primitive, including oneDNN GPU runtime abstraction, convolution, matmul, etc.
* Glue codes: `aten/src/ATen/native/mkldnn/xpu/detail` 

We currently focus on the integration of oneDNN GPU primitive, and no modification would be introduced at CPU side codes. As for the long-term, CPU and GPU may have a unified method to call oneDNN primitives. Then, CPU&GPU both could call oneDNN primitive directly. The `detail` directory could be shared by both CPU and GPU. CPU&GPU also would have unified source files for implementing ATen ops.  The file structure would be more concise then. The structure would be like:
* CPU&GPU: `aten/src/aten/native/mkldnn/Conv.cpp`, `aten/src/aten/native/mkldnn/Linear.cpp`, etc.
* Glue codes: `aten/src/aten/native/mkldnn/detail`

Just like the Figure as follows:



![file_structure_future](https://github.com/pytorch/pytorch/assets/12457857/829d7079-d4ac-4d70-bee5-017bc2ad94f2)


# PR Plan
- [x] Convolution operators(convolution, convolution_backward..)
- [x] GEMM operators(addmm, linear, matmul...)
- [x] Conv&GEMM fusion support(conv_pointwise, linear_pointwise)
- [x] Other operators(mv, baddmm...)


cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]Intel GPU oneDNN Upstreaming #114848

🚀 The feature, motivation and pitch

Motivation

Proposal for Library Structure

Operator Scope

PR Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description	Details	Priority
Aten ops	_convolution, convolution_backward, linear, addmm, mm, bmm	P0
Fusion Ops	convoultion_pointwise, convolution_pointwise_binary , convolution_pointwise_binary_, linear_pointwise, linear_pointwise_binary	P1
Ops used less frequently	addbmm, deconvolution, baddmm, addbmm...	P2

[RFC]Intel GPU oneDNN Upstreaming #114848

Description

🚀 The feature, motivation and pitch

Motivation

Proposal for Library Structure

Operator Scope

PR Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions