Add a CI workflow for tests that requires pytorch CUDA.#7140
Add a CI workflow for tests that requires pytorch CUDA.#7140vanbasten23 merged 14 commits intomasterfrom
Conversation
|
The original comment is at #7073 |
will-cromar
left a comment
There was a problem hiding this comment.
Overall LGTM. Thanks!
| run: | | ||
| echo "PATH=$PATH:/usr/local/cuda-12.1/bin" >> $GITHUB_ENV | ||
| echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.1/lib64" >> $GITHUB_ENV | ||
| - name: Check GPU |
There was a problem hiding this comment.
You don't need a GPU present for the GPU build
| echo "Check if CUDA is available for PyTorch..." | ||
| python -c "import torch; assert torch.cuda.is_available()" | ||
| echo "CUDA is available for PyTorch." | ||
| - name: Record PyTorch commit |
There was a problem hiding this comment.
Just be aware, there's a small chance of version skew in this workflow. You could end up with a broken workflow if there's a breaking change between when _bulid_torch_xla and _build_torch_with_cuda launch:
- _build_torch_xla launches, pulls pytorch, etc
- breaking change merged to pytorch
- _build_torch_with_cuda, pulls pytorch, etc
It's up to you if you want to find a way to fix that now or just wait to see if it's an issue in practice. There should really only a window of seconds between when the workflows launch.
The permanent fix would probably be to add a parent workflow step that "records" the PyTorch commit hash before either of the builds run. Maybe something to address for whoever factors out the setup steps.
There was a problem hiding this comment.
It's great point. I'll see if it's an issue in practice. If so, I'll fix it accordingly.
There was a problem hiding this comment.
Actually I saw it in this PR:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module>
import _XLAC
ImportError: /usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c[101](https://github.com/pytorch/xla/actions/runs/9308352858/job/25626741718?pr=7140#step:11:102)5SmallVectorBaseIjE8grow_podEPvmm
suggesting the torch wheel and torch_xla wheel are not compatible.
So I added a parent job to get the commit and pass it to the reusable workflow build-torch-xla and build-torch-with-cuda according to https://docs.github.com/en/actions/using-workflows/reusing-workflows#using-outputs-from-a-reusable-workflow
958796d to
0f4b2f1
Compare
|
actually, if I build torch with CUDA enabled on a machine without GPU, it fails with error segfault and OOM (https://gist.github.com/vanbasten23/faba932d31baad28196176ebd5b80948), even if I added the CUDA info to the PATH and LD_LIBRARY_PATH (otherwise, building torch with CUDA fails with https://gist.github.com/vanbasten23/338017e9b182cdf6fba49c75aabd2268) |
bd755b4 to
35ee20d
Compare
35ee20d to
e39da47
Compare
will-cromar
left a comment
There was a problem hiding this comment.
Thanks for adding the _get_torch_commit step! That's much cleaner than the way I was getting the torch commit for _test.yml 😅
| run: | | ||
| cd pytorch | ||
| torch_commit=$(git rev-parse HEAD) | ||
| echo "torch_commit=$torch_commit" >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
I just learned about the git ls-remote command, which looks perfect here. git ls-remote https://github.com/pytorch/pytorch.git HEAD | awk '{print $1}' should work, and then you don't have to clone.
If this ends up being a single step and we don't re-use it, is it possible to actually put that directly into build_and_test.yaml?
| with: | ||
| repository: pytorch/pytorch | ||
| path: pytorch | ||
| ref: ${{ inputs.torch-commit }} |
There was a problem hiding this comment.
Can you make the same change to _test.yml? This is way cleaner than what I did:
xla/.github/workflows/_test.yml
Lines 131 to 144 in 8471826
There was a problem hiding this comment.
Do you mind if I do it in a follow up PR?
| # note that to build a torch wheel with CUDA enabled, we do not need a GPU runner. | ||
| dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1 | ||
| torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}} | ||
| runner: linux.8xlarge.nvidia.gpu |
There was a problem hiding this comment.
A non-GPU machine should work here, as long as you have the CUDA docker image.
There was a problem hiding this comment.
Yeah, I tried it on a non-GPU machine but it didn't work. I put the detail here #7140 (comment).
|
Thanks Will for the review! |
|
cc @bhavya01 this should unblock you. |
This PR adds a new CI workflow that build pytorch with CUDA enabled from source, then run tests that requires torch CUDA. Today, the tests requiring torch CUDA are skipped.
In detail, this PR add 2 more jobs to .github/workflows/build_and_test.yml