Skip to content

Add a CI workflow for tests that requires pytorch CUDA.#7140

Merged
vanbasten23 merged 14 commits intomasterfrom
xiowei/addCIWorkflow2
May 31, 2024
Merged

Add a CI workflow for tests that requires pytorch CUDA.#7140
vanbasten23 merged 14 commits intomasterfrom
xiowei/addCIWorkflow2

Conversation

@vanbasten23
Copy link
Copy Markdown
Collaborator

@vanbasten23 vanbasten23 commented May 29, 2024

This PR adds a new CI workflow that build pytorch with CUDA enabled from source, then run tests that requires torch CUDA. Today, the tests requiring torch CUDA are skipped.

In detail, this PR add 2 more jobs to .github/workflows/build_and_test.yml

  1. build-torch-with-cuda: build pytorch with cuda
  2. test-cuda-with-pytorch-cuda-enabled: only run the tests that requires pytorch CUDA.

@vanbasten23 vanbasten23 marked this pull request as ready for review May 30, 2024 14:57
@vanbasten23 vanbasten23 changed the title [DO NOT REVIEW YET]add workflow for test that requires pytorch CUDA. Add a CI workflow for tests that requires pytorch CUDA. May 30, 2024
@vanbasten23
Copy link
Copy Markdown
Collaborator Author

The original comment is at #7073

Copy link
Copy Markdown
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks!

Comment thread .github/workflows/_build_torch_with_cuda.yml Outdated
run: |
echo "PATH=$PATH:/usr/local/cuda-12.1/bin" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.1/lib64" >> $GITHUB_ENV
- name: Check GPU
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a GPU present for the GPU build

Comment thread .github/workflows/_build_torch_with_cuda.yml Outdated
Comment thread .github/workflows/_build_torch_with_cuda.yml Outdated
Comment thread .github/workflows/_test_requiring_torch_cuda.yml Outdated
echo "Check if CUDA is available for PyTorch..."
python -c "import torch; assert torch.cuda.is_available()"
echo "CUDA is available for PyTorch."
- name: Record PyTorch commit
Copy link
Copy Markdown
Collaborator

@will-cromar will-cromar May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just be aware, there's a small chance of version skew in this workflow. You could end up with a broken workflow if there's a breaking change between when _bulid_torch_xla and _build_torch_with_cuda launch:

  • _build_torch_xla launches, pulls pytorch, etc
  • breaking change merged to pytorch
  • _build_torch_with_cuda, pulls pytorch, etc

It's up to you if you want to find a way to fix that now or just wait to see if it's an issue in practice. There should really only a window of seconds between when the workflows launch.

The permanent fix would probably be to add a parent workflow step that "records" the PyTorch commit hash before either of the builds run. Maybe something to address for whoever factors out the setup steps.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great point. I'll see if it's an issue in practice. If so, I'll fix it accordingly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I saw it in this PR:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/torch_xla/__init__.py", line 7, in <module>
    import _XLAC
ImportError: /usr/local/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c[101](https://github.com/pytorch/xla/actions/runs/9308352858/job/25626741718?pr=7140#step:11:102)5SmallVectorBaseIjE8grow_podEPvmm

suggesting the torch wheel and torch_xla wheel are not compatible.

So I added a parent job to get the commit and pass it to the reusable workflow build-torch-xla and build-torch-with-cuda according to https://docs.github.com/en/actions/using-workflows/reusing-workflows#using-outputs-from-a-reusable-workflow

Comment thread infra/ansible/roles/build_srcs/tasks/main.yaml Outdated
Comment thread .github/workflows/_test_requiring_torch_cuda.yml
@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow2 branch from 958796d to 0f4b2f1 Compare May 30, 2024 18:19
@vanbasten23
Copy link
Copy Markdown
Collaborator Author

actually, if I build torch with CUDA enabled on a machine without GPU, it fails with error segfault and OOM (https://gist.github.com/vanbasten23/faba932d31baad28196176ebd5b80948), even if I added the CUDA info to the PATH and LD_LIBRARY_PATH (otherwise, building torch with CUDA fails with https://gist.github.com/vanbasten23/338017e9b182cdf6fba49c75aabd2268)

@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow2 branch from bd755b4 to 35ee20d Compare May 31, 2024 02:07
@vanbasten23 vanbasten23 force-pushed the xiowei/addCIWorkflow2 branch from 35ee20d to e39da47 Compare May 31, 2024 02:09
Copy link
Copy Markdown
Collaborator

@will-cromar will-cromar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the _get_torch_commit step! That's much cleaner than the way I was getting the torch commit for _test.yml 😅

run: |
cd pytorch
torch_commit=$(git rev-parse HEAD)
echo "torch_commit=$torch_commit" >> "$GITHUB_OUTPUT"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just learned about the git ls-remote command, which looks perfect here. git ls-remote https://github.com/pytorch/pytorch.git HEAD | awk '{print $1}' should work, and then you don't have to clone.

If this ends up being a single step and we don't re-use it, is it possible to actually put that directly into build_and_test.yaml?

with:
repository: pytorch/pytorch
path: pytorch
ref: ${{ inputs.torch-commit }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the same change to _test.yml? This is way cleaner than what I did:

- name: Record PyTorch commit
run: |
# Don't just pipe output in shell because imports may do extra logging
python -c "
import torch_xla.version
with open('$GITHUB_ENV', 'a') as f:
f.write(f'PYTORCH_COMMIT={torch_xla.version.__torch_gitrev__}\n')
"
- name: Checkout PyTorch Repo
uses: actions/checkout@v4
with:
repository: pytorch/pytorch
path: pytorch
ref: ${{ env.PYTORCH_COMMIT }}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I do it in a follow up PR?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, works for me.

# note that to build a torch wheel with CUDA enabled, we do not need a GPU runner.
dev-image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.10_cuda_12.1
torch-commit: ${{needs.get-torch-commit.outputs.torch_commit}}
runner: linux.8xlarge.nvidia.gpu
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A non-GPU machine should work here, as long as you have the CUDA docker image.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried it on a non-GPU machine but it didn't work. I put the detail here #7140 (comment).

@vanbasten23
Copy link
Copy Markdown
Collaborator Author

Thanks Will for the review!

@vanbasten23 vanbasten23 merged commit 6fadbf5 into master May 31, 2024
@vanbasten23
Copy link
Copy Markdown
Collaborator Author

cc @bhavya01 this should unblock you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants