Skip to content

pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments#79898

Closed
clee2000 wants to merge 13 commits intomasterfrom
csl/pytest-new
Closed

pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments#79898
clee2000 wants to merge 13 commits intomasterfrom
csl/pytest-new

Conversation

@clee2000
Copy link
Copy Markdown
Contributor

@clee2000 clee2000 commented Jun 20, 2022

This PR uses pytest to run test_ops, test_ops_gradients, and test_ops_jit in parallel in non linux cuda environments to decrease TTS. I am excluding linux cuda because running in parallel results in errors due to running out of memory

Notes:

  • update hypothesis version for compatability with pytest
  • use rerun-failures to rerun tests (similar to flaky tests, although these test files generally don't have flaky tests)
    • reruns are denoted by a rerun tag in the xml. Failed reruns also have the failure tag. Successes (meaning that the test is flaky) do not have the failure tag.
  • see https://docs.google.com/spreadsheets/d/1aO0Rbg3y3ch7ghipt63PG2KNEUppl9a5b18Hmv2CZ4E/edit#gid=602543594 for info on speedup (or slowdown in the case of slow tests)
    • expecting windows tests to decrease by 60 minutes total
  • slow test infra is expected to stay the same - verified by running pytest and unittest on the same job and check the number of skipped/run tests
  • test reports to s3 changed - add entirely new table to keep track of invoking_file times

@clee2000 clee2000 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2022
@clee2000 clee2000 force-pushed the csl/pytest-new branch 2 times, most recently from 1d116c8 to cf4524b Compare June 20, 2022 21:19
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jun 20, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 1332182 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@clee2000 clee2000 force-pushed the csl/pytest-new branch 5 times, most recently from 136af30 to 6b31474 Compare June 21, 2022 18:51
@clee2000 clee2000 force-pushed the csl/pytest-new branch 2 times, most recently from 95dfee5 to 38e67dd Compare June 30, 2022 19:45
@clee2000 clee2000 force-pushed the csl/pytest-new branch 13 times, most recently from f40ddce to c93465e Compare July 12, 2022 17:05
@clee2000 clee2000 changed the title -n=1 pytest to run test_ops, test_ops_gradients, and test_ops_jit in non linux cuda environments Jul 13, 2022
@clee2000 clee2000 changed the title pytest to run test_ops, test_ops_gradients, and test_ops_jit in non linux cuda environments pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments Jul 13, 2022
@clee2000 clee2000 added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 13, 2022
@clee2000 clee2000 marked this pull request as ready for review July 13, 2022 20:44
@clee2000 clee2000 requested a review from mruberry as a code owner July 13, 2022 20:44
# when running in parallel in pytest, adding the test times will not give the correct
# time used to run the file, which will make the sharding incorrect, so if the test is
# run in parallel, we take the time reported by the testsuite
if key in pytest_parallel_times:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so you track the pytest times, and then for these test summaries, and these test summaries ONLY, you override the time corresponding to the invoking file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait you're not modifying the existing stats, you're just adding a new table for file times!

Catherine Lee added 2 commits July 18, 2022 11:35
@clee2000 clee2000 requested a review from janeyx99 July 18, 2022 18:38
upload_to_s3(
args.workflow_run_id,
args.workflow_run_attempt,
"invoking_file_times",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to verify that these stats are what you expect after landing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like in s3?

@clee2000
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Copy Markdown
Contributor

Hey @clee2000.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jul 20, 2022
… cuda environments (#79898) (#79898)

Summary:
This PR uses pytest to run test_ops, test_ops_gradients, and test_ops_jit in parallel in non linux cuda environments to decrease TTS.  I am excluding linux cuda because running in parallel results in errors due to running out of memory

Notes:
* update hypothesis version for compatability with pytest
* use rerun-failures to rerun tests (similar to flaky tests, although these test files generally don't have flaky tests)
  * reruns are denoted by a rerun tag in the xml.  Failed reruns also have the failure tag.  Successes (meaning that the test is flaky) do not have the failure tag.
* see https://docs.google.com/spreadsheets/d/1aO0Rbg3y3ch7ghipt63PG2KNEUppl9a5b18Hmv2CZ4E/edit#gid=602543594 for info on speedup (or slowdown in the case of slow tests)
  * expecting windows tests to decrease by 60 minutes total
* slow test infra is expected to stay the same - verified by running pytest and unittest on the same job and check the number of skipped/run tests
* test reports to s3 changed - add entirely new table to keep track of invoking_file times

Pull Request resolved: #79898
Approved by: https://github.com/malfet, https://github.com/janeyx99

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/06a0cfc0ea0cc703a1ebc8148181ac3e3cb80ab5

Reviewed By: jeanschmidt

Differential Revision: D37990830

Pulled By: clee2000

fbshipit-source-id: bf781f39829c03f167470e2222ed0496a54fca72
verbosity=2 if verbose else 1,
resultclass=XMLTestResultVerbose))
if test_filename in PYTEST_FILES and not IS_SANDCASTLE and not (
"cuda" in os.environ["BUILD_ENVIRONMENT"] and "linux" in os.environ["BUILD_ENVIRONMENT"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @malfet @janeyx99 , it seems like our test_ops tests were completely skipped after this PR because we don't have BUILD_ENVIRONMENT in our environment. Is this check really necessary for non-github-CI builds? Can you provide a fix to re-enable test_ops and other PYTEST_FILES tests?

cc @ptrblck

Running test_ops ... [2022-07-28 03:25:21.888236]
Executing ['/opt/conda/bin/python', '-bb', 'test_ops.py', '-v', '--save-xml', '--import-slow-tests', '--import-disabled-tests'] ... [2022-07-28 03:25:21.888297]
Traceback (most recent call last):
  File "test_ops.py", line 1736, in <module>
    run_tests()
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 716, in run_tests
    "cuda" in os.environ["BUILD_ENVIRONMENT"] and "linux" in os.environ["BUILD_ENVIRONMENT"]
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'BUILD_ENVIRONMENT'
test_ops failed!

pytorchmergebot pushed a commit that referenced this pull request Jul 29, 2022
### Description
quick fix for #79898 (comment)

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: #82452
Approved by: https://github.com/huydhn
@clee2000 clee2000 deleted the csl/pytest-new branch September 28, 2022 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants