[Nova] Add GHA Linux CPU Unittests for Torchvision by osalpekar · Pull Request #6759 · pytorch/vision

osalpekar · 2022-10-12T17:54:36Z

Adding a GitHub Action to run Linux CPU Unittests.

For context, it seems like standard/2xlarge instances cause this job to OOM. 4xlarge runs most of the tests, but a few failure due to OOM as well. 8xlarge instances seem to have much higher queueing times. CircleCI simply requests 2xlarge+ in its resource config: https://github.com/pytorch/vision/blob/main/.circleci/config.yml#L739

datumbox · 2022-10-13T11:42:14Z

@osalpekar I understand this PR is still WIP. Shall we mark as draft until you are ready? Feel free to ping us when you are done to help you review. :)

osalpekar · 2022-10-13T15:29:09Z

@osalpekar I understand this PR is still WIP. Shall we mark as draft until you are ready? Feel free to ping us when you are done to help you review. :)

@datumbox - Sure thing, marking this as a draft :)

osalpekar · 2022-10-13T19:17:12Z

Linux.4xlarge instance sees the job complete but a handful of tests fail to allocate new memory for forking a process: https://github.com/pytorch/vision/actions/runs/3244683126/jobs/5321174303.

.github/workflows/test-linux-cpu.yml

huydhn · 2022-10-14T01:09:27Z

.github/workflows/test-linux-cpu.yml

+        if: ${{ (github.event_name == 'pull_request' && startsWith(github.base_ref, 'release')) || startsWith(github.ref, 'refs/heads/release') }}
+        run: |
+          echo "CHANNEL=test" >> "$GITHUB_ENV"
+      - name: Install TorchVision


Nit: you might want to try out https://github.com/pytorch/test-infra/tree/main/.github/actions/setup-miniconda here, it would hide all the complex logic there like ENV_NAME I think

Seeing such errors when using the setup-miniconda action: https://github.com/pytorch/vision/actions/runs/3252067256/jobs/5337858164. Perhaps this is due to conda-build being installed in the same conda-env? Reverting to the local conda env for now.

huydhn · 2022-10-14T01:17:21Z

lol, from what I see https://circleci.com/docs/configuration-reference/, Circle CI 2xlarge+ is the largest tier that Circle CI has for Docker and it's not the same as our self-hosted AWS linux.2xlarge.

Circle CI 2xlarge+ has 20 vCPU and 40GB of memory, which is somewhere in between AWS c5.4xlarge and c5.12xlarge https://aws.amazon.com/ec2/instance-types/c5. This explains why 4xlarge still fails given that it has only "32GB" of memory

osalpekar · 2022-10-14T16:14:30Z

lol, from what I see https://circleci.com/docs/configuration-reference/, Circle CI 2xlarge+ is the largest tier that Circle CI has for Docker and it's not the same as our self-hosted AWS linux.2xlarge.

Circle CI 2xlarge+ has 20 vCPU and 40GB of memory, which is somewhere in between AWS c5.4xlarge and c5.12xlarge https://aws.amazon.com/ec2/instance-types/c5. This explains why 4xlarge still fails given that it has only "32GB" of memory

Thanks @huydhn! 12xlarge does the trick: https://github.com/pytorch/vision/actions/runs/3251225589/jobs/5335938671. Will cleanup, use the conda-setup job, and use the build matrix to cover all the configs we want next.

.github/workflows/test-linux-cpu.yml

huydhn

The workflow looks good to me 💯 Let's get a stamp from vision folks.

datumbox

LGTM, thanks!

vfdev-5 · 2022-10-17T08:20:25Z

@huydhn @osalpekar I was trying to setup in a similar way as here pytorch/conda-builder:cuda116 container (#6665) for tests with cuda. But somehow, nvidia-smi which is in the container is not seen in the CI step. Can you please take a look and help enabling cuda tests. Thanks

huydhn · 2022-10-17T17:07:11Z

@huydhn @osalpekar I was trying to setup in a similar way as here pytorch/conda-builder:cuda116 container (#6665) for tests with cuda. But somehow, nvidia-smi which is in the container is not seen in the CI step. Can you please take a look and help enabling cuda tests. Thanks

Add my thoughts on #6665 on why nvidia-smi doesn't show up there

github-actions · 2022-10-17T18:01:43Z

Hey @osalpekar!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

Summary: * [Nova][WIP] Add Linux CPU Unittests for Torchvision * use conda-builder image since conda installation is needed * install torch dep with conda instead * use circleCI command to run tests * larger instance to avoid OOM issues * proper syntax for self-hosted runners * 4xlarge instance * 8xlarge * 12xlarge * use setup-miniconda job * add back PATH change to help setup py detect conda * run conda shell script * install other deps up front * git config and undo path change * revert to local conda install * conda-builder image * support for whole python version matrix * clean up the conda env once we are done with the job Reviewed By: YosuaMichael Differential Revision: D40588169 fbshipit-source-id: 515b12daa84d1707f6b700782fade13f8532ff05

facebook-github-bot added the cla signed label Oct 12, 2022

osalpekar changed the title ~~[Nova][WIP] Add Linux CPU Unittests for Torchvision~~ [Nova][WIP] Add GHA Linux CPU Unittests for Torchvision Oct 12, 2022

osalpekar marked this pull request as draft October 13, 2022 15:29

osalpekar force-pushed the linux_cpu_unittests branch from e7a1862 to ab0e710 Compare October 13, 2022 15:47

huydhn reviewed Oct 14, 2022

View reviewed changes

.github/workflows/test-linux-cpu.yml Outdated Show resolved Hide resolved

huydhn reviewed Oct 14, 2022

View reviewed changes

osalpekar force-pushed the linux_cpu_unittests branch from 901d159 to 5133175 Compare October 14, 2022 15:48

osalpekar force-pushed the linux_cpu_unittests branch from 21c4223 to 4003322 Compare October 14, 2022 21:08

osalpekar marked this pull request as ready for review October 14, 2022 21:08

osalpekar changed the title ~~[Nova][WIP] Add GHA Linux CPU Unittests for Torchvision~~ [Nova] Add GHA Linux CPU Unittests for Torchvision Oct 14, 2022

huydhn reviewed Oct 14, 2022

View reviewed changes

.github/workflows/test-linux-cpu.yml Show resolved Hide resolved

huydhn approved these changes Oct 14, 2022

View reviewed changes

osalpekar requested a review from datumbox October 14, 2022 21:34

datumbox approved these changes Oct 15, 2022

View reviewed changes

osalpekar added 9 commits October 17, 2022 13:57

[Nova][WIP] Add Linux CPU Unittests for Torchvision

4f9a4a9

use conda-builder image since conda installation is needed

038a1c3

install torch dep with conda instead

8770800

use circleCI command to run tests

f12310b

larger instance to avoid OOM issues

360069e

proper syntax for self-hosted runners

e894142

4xlarge instance

8c51fd5

8xlarge

20d30e2

12xlarge

02b1dd4

osalpekar added 9 commits October 17, 2022 13:57

use setup-miniconda job

7b0e1a5

add back PATH change to help setup py detect conda

86a8314

run conda shell script

5b36d4d

install other deps up front

9a6115c

git config and undo path change

2336c08

revert to local conda install

f3061eb

conda-builder image

3262441

support for whole python version matrix

a37225b

clean up the conda env once we are done with the job

2167e0b

osalpekar force-pushed the linux_cpu_unittests branch from 2351d3d to 2167e0b Compare October 17, 2022 17:57

osalpekar merged commit 0610b13 into pytorch:main Oct 17, 2022

vfdev-5 added module: tests module: ci labels Oct 17, 2022

Conversation

osalpekar commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datumbox commented Oct 13, 2022

Uh oh!

osalpekar commented Oct 13, 2022

Uh oh!

osalpekar commented Oct 13, 2022

Uh oh!

Uh oh!

huydhn Oct 14, 2022

Choose a reason for hiding this comment

Uh oh!

osalpekar Oct 14, 2022

Choose a reason for hiding this comment

Uh oh!

huydhn commented Oct 14, 2022

Uh oh!

osalpekar commented Oct 14, 2022

Uh oh!

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

vfdev-5 commented Oct 17, 2022

Uh oh!

huydhn commented Oct 17, 2022

Uh oh!

github-actions bot commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

osalpekar commented Oct 12, 2022 •

edited

Loading