[Nova] Add GHA Linux CPU Unittests for Torchvision#6759
[Nova] Add GHA Linux CPU Unittests for Torchvision#6759osalpekar merged 18 commits intopytorch:mainfrom
Conversation
|
@osalpekar I understand this PR is still WIP. Shall we mark as draft until you are ready? Feel free to ping us when you are done to help you review. :) |
@datumbox - Sure thing, marking this as a draft :) |
e7a1862 to
ab0e710
Compare
|
Linux.4xlarge instance sees the job complete but a handful of tests fail to allocate new memory for forking a process: https://github.com/pytorch/vision/actions/runs/3244683126/jobs/5321174303. |
.github/workflows/test-linux-cpu.yml
Outdated
| if: ${{ (github.event_name == 'pull_request' && startsWith(github.base_ref, 'release')) || startsWith(github.ref, 'refs/heads/release') }} | ||
| run: | | ||
| echo "CHANNEL=test" >> "$GITHUB_ENV" | ||
| - name: Install TorchVision |
There was a problem hiding this comment.
Nit: you might want to try out https://github.com/pytorch/test-infra/tree/main/.github/actions/setup-miniconda here, it would hide all the complex logic there like ENV_NAME I think
There was a problem hiding this comment.
Seeing such errors when using the setup-miniconda action: https://github.com/pytorch/vision/actions/runs/3252067256/jobs/5337858164. Perhaps this is due to conda-build being installed in the same conda-env? Reverting to the local conda env for now.
|
lol, from what I see https://circleci.com/docs/configuration-reference/, Circle CI 2xlarge+ is the largest tier that Circle CI has for Docker and it's not the same as our self-hosted AWS linux.2xlarge. Circle CI 2xlarge+ has 20 vCPU and 40GB of memory, which is somewhere in between AWS c5.4xlarge and c5.12xlarge https://aws.amazon.com/ec2/instance-types/c5. This explains why 4xlarge still fails given that it has only "32GB" of memory |
901d159 to
5133175
Compare
Thanks @huydhn! 12xlarge does the trick: https://github.com/pytorch/vision/actions/runs/3251225589/jobs/5335938671. Will cleanup, use the conda-setup job, and use the build matrix to cover all the configs we want next. |
21c4223 to
4003322
Compare
huydhn
left a comment
There was a problem hiding this comment.
The workflow looks good to me 💯 Let's get a stamp from vision folks.
|
@huydhn @osalpekar I was trying to setup in a similar way as here |
Add my thoughts on #6665 on why |
2351d3d to
2167e0b
Compare
|
Hey @osalpekar! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
Summary: * [Nova][WIP] Add Linux CPU Unittests for Torchvision * use conda-builder image since conda installation is needed * install torch dep with conda instead * use circleCI command to run tests * larger instance to avoid OOM issues * proper syntax for self-hosted runners * 4xlarge instance * 8xlarge * 12xlarge * use setup-miniconda job * add back PATH change to help setup py detect conda * run conda shell script * install other deps up front * git config and undo path change * revert to local conda install * conda-builder image * support for whole python version matrix * clean up the conda env once we are done with the job Reviewed By: YosuaMichael Differential Revision: D40588169 fbshipit-source-id: 515b12daa84d1707f6b700782fade13f8532ff05
Adding a GitHub Action to run Linux CPU Unittests.
For context, it seems like standard/2xlarge instances cause this job to OOM. 4xlarge runs most of the tests, but a few failure due to OOM as well. 8xlarge instances seem to have much higher queueing times. CircleCI simply requests 2xlarge+ in its resource config: https://github.com/pytorch/vision/blob/main/.circleci/config.yml#L739