Skip to content

[RFC]Intel GPU oneDNN Upstreaming #114848

@ZhiweiYan-96

Description

@ZhiweiYan-96

🚀 The feature, motivation and pitch

Motivation

Intel GPUs could significantly improve the workload performance. As described in [RFC] Intel GPU Upstreaming, we intend to leverage Intel's advancements in GPU technology to enhance PyTorch's performance and versatility. oneDNN library (Intel® oneAPI Deep Neural Network Library) already has mature support for Intel GPUs and it is a key component in the Intel GPU software stack.

This RFC intends to introduce the oneDNN GPU support in PyTorch, including Conv, GEMM and other highly optimized kernels. With the support, users could deliver high performance on Intel GPUs more easily. Moreover, Intel GPU support for PyTorch could also evolve quickly with the development of the community.

Proposal for Library Structure

PyTorch has integrated oneDNN as a git submodule for CPU support. For Intel GPU support, we will reuse the same oneDNN codebase. To minimize the integration effort, we intend to separately build oneDNN as a static library for Intel CPU and GPU, respectively.

Compiling CPU source code in oneDNN at pytorch/third_party/ideep/mkl-dnn generate a static library libdnnl.a. It is linked to libtorch_cpu.so. This is the existing behavior in PyTorch, no modification will be introduced for this side.

Compiling GPU source codes in oneDNN would also generates a libdnnl.a static library. GPU and CPU can share the same oneDNN codebase in pytorch/third_party/ideep/mkl-dnn to minimize the integration effort. We would make the libdnnl.a for GPU be linked to libtorch_xpu.so.

Operator Scope

We suggest that the operator coverage grows in a step-by-step way. The coverage would be driven by the dynamo benchmarks-HF, TIMM and TorchBench. The high-priority ops can be supported first. For example, Conv & GEMM related ops should have the highest priority. The ATen operators like _convolution, linear, addmm can be firstly implemented. The implementation can be registered at XPU dispatchkey for Intel GPUs which is in PyTorch already. With these operators, most benchmark models can run on Intel GPUs.

Then, optimized fusion operators like convolution_pointwise, linear_pointwise, convolution_pointwise_binary etc. can be supported. These operators will share the same schema as the ones already registered in MkldnnCPU dispatchkey, and they are registered as the implementation for XPU dispatchkey. The entry points for these series of operators are convolution_pointwise and linear_pointwise. Fusion ops are beneficial for delivering better performance at graph mode, such as torch.compile.

At last, other operators can then be supported gradually (not target PyTorch 2.5), for example, baddmm, deconvolution. We can gradually complete the coverage based on the request.

Description Details Priority
Aten ops _convolution, convolution_backward, linear, addmm, mm, bmm P0
Fusion Ops convoultion_pointwise, convolution_pointwise_binary , convolution_pointwise_binary_, linear_pointwise, linear_pointwise_binary P1
Ops used less frequently addbmm, deconvolution, baddmm, addbmm... P2

We propose to put GPU code in the same directory as CPU, as the Figure below shows. The directory is, namely,

  • CPU: aten/src/ATen/native/mkldnn/cpu
  • GPU: aten/src/ATen/native/mkldnn/xpu

file_structure

Source codes (Conv.cpp, Linear.cpp) at this directory would implement the operators above, accompanied by their registration. Another directory called detail contains a series of glue codes calling oneDNN primitive, including oneDNN GPU runtime abstraction, convolution, matmul, etc.

  • Glue codes: aten/src/ATen/native/mkldnn/xpu/detail

We currently focus on the integration of oneDNN GPU primitive, and no modification would be introduced at CPU side codes. As for the long-term, CPU and GPU may have a unified method to call oneDNN primitives. Then, CPU&GPU both could call oneDNN primitive directly. The detail directory could be shared by both CPU and GPU. CPU&GPU also would have unified source files for implementing ATen ops. The file structure would be more concise then. The structure would be like:

  • CPU&GPU: aten/src/aten/native/mkldnn/Conv.cpp, aten/src/aten/native/mkldnn/Linear.cpp, etc.
  • Glue codes: aten/src/aten/native/mkldnn/detail

Just like the Figure as follows:

file_structure_future

PR Plan

  • Convolution operators(convolution, convolution_backward..)
  • GEMM operators(addmm, linear, matmul...)
  • Conv&GEMM fusion support(conv_pointwise, linear_pointwise)
  • Other operators(mv, baddmm...)

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: intelSpecific to x86 architecturetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions