Skip to content

[RFC] Add XPU support into Pytorch CI/CD #1

@chuanqi129

Description

@chuanqi129

Motivation

XPU as a new device will be upstreamed into Pytorch, to support the new device and features upstreaming, we must enable Pytorch CI/CD XPU device related tests to gate the quality of the develop pull requests accordingly.

Design philosophy

For the new part CI/CD enabling, we want to reuse existing infrastructure in Pytorch CI/CD as much as possible. And refer to the workflows of other devices. XPU related build / test will be dispatched to Intel Develop Cloud (IDC) Instance self-hosted runners, which has XPU hardware.

  • Docker based tests
  • Unify entrance workflow of XPU tests
  • Multiple stage tests
    • Base Docker image build on AWS instance runners
    • Wheel build with base Docker image on IDC instance runners
    • Tests can be sharded and dispatched on IDC instance runners

Detail

Entrance workflow

The XPU device related tests will be triggered by PR with specific label or regular triggered by timer. Plan to add a new workflow .github/workflows/xpu.yml

name: xpu

on:
  push:
    branches:
      - main
      - release/*
    tags:
      - ciflow/xpu/*
  workflow_dispatch:
  schedule:
    - cron: 29 8 * * *  # about 1:29am PDT

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
  cancel-in-progress: true

jobs:
  linux-focal-xpu-py3_8-build:
    name: linux-focal-xpu-py3.8
    uses: ./.github/workflows/_linux-build.yml
    with:
      build-environment: linux-focal-xpu-py3.8
      docker-image-name: pytorch-linux-focal-xpu-n-py3
      sync-tag: xpu-build
      test-matrix: |
        { include: [
          { config: "default", shard: 1, num_shards: 2, runner: "linux.idc.xpu" },
          { config: "default", shard: 1, num_shards: 2, runner: "linux.idc.xpu" },
        ]}

  linux-focal-xpu-py3_8-test:
    name: linux-focal-xpu-py3.8
    uses: ./.github/workflows/_xpu-test.yml
    needs: linux-focal-xpu-py3_8-build
    with:
      build-environment: linux-focal-xpu-py3.8
      docker-image: ${{ needs.linux-focal-xpu-py3_8-build.outputs.docker-image }}
      test-matrix: ${{ needs.linux-focal-xpu-py3_8-build.outputs.test-matrix }}

Build & Test

Will add XPU specific base image Dockerfile .ci/docker/ubuntu-xpu/Dockerfile and XPU part into image build script .ci/docker/build.sh to support XPU based image build on linux.2xlarge runners.

For Pytorch wheel build, different with other devices, currently we need dispatch it to IDC instance runners. We will reuse .github/workflows/_linux-build.yml with XPU specific build-environment and add XPU part into Pytorch build script .ci/pytorch/build.sh.

For the test part, we will add a new XPU test workflow .github/workflows/_xpu-test.yml and some necessary GHA such as setup-xpu, teardown-xpu, etc. We also will add new part in test script .ci/pytorch/test.sh and a series utils scripts for XPU.

We also consider the other tests, for example inductor related, which will be implemented with similar philosophy in the future.

Open

We noticed that currently Pytorch CI/CD infrastructure relies on AWS docker registry and S3 service for Docker image and Wheel/Test results sharing among the workflows / test jobs. We want to know if we dispatched the wheel build and test into the IDC instance runner, how can the IDC runner interactive with those services?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions