Add a CI workflow for tests that requires pytorch CUDA. by vanbasten23 · Pull Request #7140 · pytorch/xla

vanbasten23 · 2024-05-29T14:11:32Z

This PR adds a new CI workflow that build pytorch with CUDA enabled from source, then run tests that requires torch CUDA. Today, the tests requiring torch CUDA are skipped.

In detail, this PR add 2 more jobs to .github/workflows/build_and_test.yml

build-torch-with-cuda: build pytorch with cuda
test-cuda-with-pytorch-cuda-enabled: only run the tests that requires pytorch CUDA.

vanbasten23 · 2024-05-30T16:42:40Z

The original comment is at #7073

will-cromar

Overall LGTM. Thanks!

will-cromar · 2024-05-30T17:03:08Z

+        run: |
+          echo "PATH=$PATH:/usr/local/cuda-12.1/bin" >> $GITHUB_ENV
+          echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.1/lib64" >> $GITHUB_ENV
+      - name: Check GPU


You don't need a GPU present for the GPU build

will-cromar · 2024-05-30T17:08:49Z

+          echo "Check if CUDA is available for PyTorch..."
+          python -c "import torch; assert torch.cuda.is_available()"
+          echo "CUDA is available for PyTorch."
+      - name: Record PyTorch commit


Just be aware, there's a small chance of version skew in this workflow. You could end up with a broken workflow if there's a breaking change between when _bulid_torch_xla and _build_torch_with_cuda launch:

_build_torch_xla launches, pulls pytorch, etc

breaking change merged to pytorch

_build_torch_with_cuda, pulls pytorch, etc

It's up to you if you want to find a way to fix that now or just wait to see if it's an issue in practice. There should really only a window of seconds between when the workflows launch.

The permanent fix would probably be to add a parent workflow step that "records" the PyTorch commit hash before either of the builds run. Maybe something to address for whoever factors out the setup steps.

It's great point. I'll see if it's an issue in practice. If so, I'll fix it accordingly.

Actually I saw it in this PR:

Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/local/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module> import _XLAC ImportError: /usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c[101](https://github.com/pytorch/xla/actions/runs/9308352858/job/25626741718?pr=7140#step:11:102)5SmallVectorBaseIjE8grow_podEPvmm

suggesting the torch wheel and torch_xla wheel are not compatible.

So I added a parent job to get the commit and pass it to the reusable workflow build-torch-xla and build-torch-with-cuda according to https://docs.github.com/en/actions/using-workflows/reusing-workflows#using-outputs-from-a-reusable-workflow

vanbasten23 · 2024-05-30T19:46:43Z

actually, if I build torch with CUDA enabled on a machine without GPU, it fails with error segfault and OOM (https://gist.github.com/vanbasten23/faba932d31baad28196176ebd5b80948), even if I added the CUDA info to the PATH and LD_LIBRARY_PATH (otherwise, building torch with CUDA fails with https://gist.github.com/vanbasten23/338017e9b182cdf6fba49c75aabd2268)

will-cromar

Thanks for adding the _get_torch_commit step! That's much cleaner than the way I was getting the torch commit for _test.yml 😅

will-cromar · 2024-05-31T16:19:52Z

+        run: |
+          cd pytorch
+          torch_commit=$(git rev-parse HEAD) 
+          echo "torch_commit=$torch_commit" >> "$GITHUB_OUTPUT"


I just learned about the git ls-remote command, which looks perfect here. git ls-remote https://github.com/pytorch/pytorch.git HEAD | awk '{print $1}' should work, and then you don't have to clone.

If this ends up being a single step and we don't re-use it, is it possible to actually put that directly into build_and_test.yaml?

will-cromar · 2024-05-31T16:21:59Z

        with:
          repository: pytorch/pytorch
          path: pytorch
+          ref: ${{ inputs.torch-commit }}


Can you make the same change to _test.yml? This is way cleaner than what I did:

xla/.github/workflows/_test.yml

Lines 131 to 144 in 8471826

- name: Record PyTorch commit

run: |

# Don't just pipe output in shell because imports may do extra logging

python -c "

import torch_xla.version

with open('$GITHUB_ENV', 'a') as f:

f.write(f'PYTORCH_COMMIT={torch_xla.version.__torch_gitrev__}\n')

"

- name: Checkout PyTorch Repo

uses: actions/checkout@v4

with:

repository: pytorch/pytorch

path: pytorch

ref: ${{ env.PYTORCH_COMMIT }}

Do you mind if I do it in a follow up PR?

Yeah, works for me.

will-cromar · 2024-05-31T16:25:12Z

+      # note that to build a torch wheel with CUDA enabled, we do not need a GPU runner.
+      dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
+      torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
+      runner: linux.8xlarge.nvidia.gpu


A non-GPU machine should work here, as long as you have the CUDA docker image.

Yeah, I tried it on a non-GPU machine but it didn't work. I put the detail here #7140 (comment).

vanbasten23 · 2024-05-31T17:47:51Z

Thanks Will for the review!

vanbasten23 · 2024-05-31T17:48:51Z

cc @bhavya01 this should unblock you.

vanbasten23 added 5 commits May 29, 2024 14:10

add workflow for test that requires pytorch CUDA.

ef58e56

make the new workflow only build torch wheel

0fe7d9b

only build torch CUDA.

5bff255

remove old checkout files.

3dc69ca

set abi to 0, in order to be consistent with our ansible.

09b9b62

vanbasten23 marked this pull request as ready for review May 30, 2024 14:57

vanbasten23 requested review from JackCaoG, lsy323 and will-cromar as code owners May 30, 2024 14:57

vanbasten23 changed the title ~~[DO NOT REVIEW YET]add workflow for test that requires pytorch CUDA.~~ Add a CI workflow for tests that requires pytorch CUDA. May 30, 2024

vanbasten23 mentioned this pull request May 30, 2024

Add CI workflow for tests that requires pytorch CUDA. #7073

Closed

rename the new test workflow to be 'GPU tests requiring torch CUDA'

4d1f839

will-cromar approved these changes May 30, 2024

View reviewed changes

fix comments

0f4b2f1

vanbasten23 force-pushed the xiowei/addCIWorkflow2 branch from 958796d to 0f4b2f1 Compare May 30, 2024 18:19

fix error Failed to detect a default CUDA architecture.

743bab8

still need to build torch cuda on gpu.

a0bb709

vanbasten23 force-pushed the xiowei/addCIWorkflow2 branch from bd755b4 to 35ee20d Compare May 31, 2024 02:07

create a parent job to fetch torch commit.

e39da47

vanbasten23 force-pushed the xiowei/addCIWorkflow2 branch from 35ee20d to e39da47 Compare May 31, 2024 02:09

vanbasten23 added 4 commits May 31, 2024 02:18

fix how we use the output

e4a58b8

pass the output from reusable workflow

9feb490

use the correct var.

bb92928

remove comment

22002a5

will-cromar reviewed May 31, 2024

View reviewed changes

vanbasten23 merged commit 6fadbf5 into master May 31, 2024

vanbasten23 mentioned this pull request May 31, 2024

Unify how we use torch commit in the CI. #7164

Merged

	- name: Record PyTorch commit
	run: \|
	# Don't just pipe output in shell because imports may do extra logging
	python -c "
	import torch_xla.version
	with open('$GITHUB_ENV', 'a') as f:
	f.write(f'PYTORCH_COMMIT={torch_xla.version.__torch_gitrev__}\n')
	"
	- name: Checkout PyTorch Repo
	uses: actions/checkout@v4
	with:
	repository: pytorch/pytorch
	path: pytorch
	ref: ${{ env.PYTORCH_COMMIT }}

Conversation

vanbasten23 commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanbasten23 commented May 30, 2024

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

will-cromar May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vanbasten23 commented May 30, 2024

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented May 31, 2024

Uh oh!

vanbasten23 commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vanbasten23 commented May 29, 2024 •

edited

Loading

will-cromar May 30, 2024 •

edited

Loading