pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments by clee2000 · Pull Request #79898 · pytorch/pytorch

clee2000 · 2022-06-20T20:51:03Z

This PR uses pytest to run test_ops, test_ops_gradients, and test_ops_jit in parallel in non linux cuda environments to decrease TTS. I am excluding linux cuda because running in parallel results in errors due to running out of memory

Notes:

update hypothesis version for compatability with pytest
use rerun-failures to rerun tests (similar to flaky tests, although these test files generally don't have flaky tests)
- reruns are denoted by a rerun tag in the xml. Failed reruns also have the failure tag. Successes (meaning that the test is flaky) do not have the failure tag.
see https://docs.google.com/spreadsheets/d/1aO0Rbg3y3ch7ghipt63PG2KNEUppl9a5b18Hmv2CZ4E/edit#gid=602543594 for info on speedup (or slowdown in the case of slow tests)
- expecting windows tests to decrease by 60 minutes total
slow test infra is expected to stay the same - verified by running pytest and unittest on the same job and check the number of skipped/run tests
test reports to s3 changed - add entirely new table to keep track of invoking_file times

facebook-github-bot · 2022-06-20T21:53:22Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79898
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 1332182 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

tools/stats/upload_test_stats.py

janeyx99 · 2022-07-18T18:20:20Z

tools/stats/upload_test_stats.py

+        # when running in parallel in pytest, adding the test times will not give the correct
+        # time used to run the file, which will make the sharding incorrect, so if the test is
+        # run in parallel, we take the time reported by the testsuite
+        if key in pytest_parallel_times:


Ah, so you track the pytest times, and then for these test summaries, and these test summaries ONLY, you override the time corresponding to the invoking file?

oh wait you're not modifying the existing stats, you're just adding a new table for file times!

test/conftest.py

janeyx99 · 2022-07-18T18:43:57Z

tools/stats/upload_test_stats.py

+    upload_to_s3(
+        args.workflow_run_id,
+        args.workflow_run_attempt,
+        "invoking_file_times",


Would be good to verify that these stats are what you expect after landing

like in s3?

clee2000 · 2022-07-19T19:49:30Z

@pytorchbot merge

pytorchmergebot · 2022-07-19T19:50:52Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-07-19T19:51:34Z

Hey @clee2000.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

… cuda environments (#79898) (#79898) Summary: This PR uses pytest to run test_ops, test_ops_gradients, and test_ops_jit in parallel in non linux cuda environments to decrease TTS. I am excluding linux cuda because running in parallel results in errors due to running out of memory Notes: * update hypothesis version for compatability with pytest * use rerun-failures to rerun tests (similar to flaky tests, although these test files generally don't have flaky tests) * reruns are denoted by a rerun tag in the xml. Failed reruns also have the failure tag. Successes (meaning that the test is flaky) do not have the failure tag. * see https://docs.google.com/spreadsheets/d/1aO0Rbg3y3ch7ghipt63PG2KNEUppl9a5b18Hmv2CZ4E/edit#gid=602543594 for info on speedup (or slowdown in the case of slow tests) * expecting windows tests to decrease by 60 minutes total * slow test infra is expected to stay the same - verified by running pytest and unittest on the same job and check the number of skipped/run tests * test reports to s3 changed - add entirely new table to keep track of invoking_file times Pull Request resolved: #79898 Approved by: https://github.com/malfet, https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/06a0cfc0ea0cc703a1ebc8148181ac3e3cb80ab5 Reviewed By: jeanschmidt Differential Revision: D37990830 Pulled By: clee2000 fbshipit-source-id: bf781f39829c03f167470e2222ed0496a54fca72

xwang233 · 2022-07-29T04:01:00Z

torch/testing/_internal/common_utils.py

-            verbosity=2 if verbose else 1,
-            resultclass=XMLTestResultVerbose))
+        if test_filename in PYTEST_FILES and not IS_SANDCASTLE and not (
+            "cuda" in os.environ["BUILD_ENVIRONMENT"] and "linux" in os.environ["BUILD_ENVIRONMENT"]


Hi @malfet @janeyx99 , it seems like our test_ops tests were completely skipped after this PR because we don't have BUILD_ENVIRONMENT in our environment. Is this check really necessary for non-github-CI builds? Can you provide a fix to re-enable test_ops and other PYTEST_FILES tests?

cc @ptrblck

Running test_ops ... [2022-07-28 03:25:21.888236] Executing ['/opt/conda/bin/python', '-bb', 'test_ops.py', '-v', '--save-xml', '--import-slow-tests', '--import-disabled-tests'] ... [2022-07-28 03:25:21.888297] Traceback (most recent call last): File "test_ops.py", line 1736, in <module> run_tests() File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 716, in run_tests "cuda" in os.environ["BUILD_ENVIRONMENT"] and "linux" in os.environ["BUILD_ENVIRONMENT"] File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__ raise KeyError(key) from None KeyError: 'BUILD_ENVIRONMENT' test_ops failed!

### Description quick fix for #79898 (comment) ### Issue  ### Testing  Pull Request resolved: #82452 Approved by: https://github.com/huydhn

clee2000 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2022

clee2000 force-pushed the csl/pytest-new branch 2 times, most recently from 1d116c8 to cf4524b Compare June 20, 2022 21:19

facebook-github-bot added the cla signed label Jun 20, 2022

clee2000 force-pushed the csl/pytest-new branch 5 times, most recently from 136af30 to 6b31474 Compare June 21, 2022 18:51

clee2000 force-pushed the csl/pytest-new branch 2 times, most recently from 95dfee5 to 38e67dd Compare June 30, 2022 19:45

clee2000 force-pushed the csl/pytest-new branch 13 times, most recently from f40ddce to c93465e Compare July 12, 2022 17:05

clee2000 changed the title ~~-n=1~~ pytest to run test_ops, test_ops_gradients, and test_ops_jit in non linux cuda environments Jul 13, 2022

clee2000 changed the title ~~pytest to run test_ops, test_ops_gradients, and test_ops_jit in non linux cuda environments~~ pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments Jul 13, 2022

clee2000 added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 13, 2022

clee2000 marked this pull request as ready for review July 13, 2022 20:44

clee2000 requested a review from mruberry as a code owner July 13, 2022 20:44

changes to uploading test times

218c99e

clee2000 force-pushed the csl/pytest-new branch from 8d48a3b to 218c99e Compare July 18, 2022 18:10