[LTS] CherryPick: Add multi gpu checker for TestZeroRedundancyOptimizer.test_collect_shards#72923
Merged
malfet merged 2 commits intopytorch:lts/release/1.8from Mar 4, 2022
jambayk:jambayk/lts-fix/multi-gpu
Merged
[LTS] CherryPick: Add multi gpu checker for TestZeroRedundancyOptimizer.test_collect_shards#72923malfet merged 2 commits intopytorch:lts/release/1.8from jambayk:jambayk/lts-fix/multi-gpu
TestZeroRedundancyOptimizer.test_collect_shards#72923malfet merged 2 commits intopytorch:lts/release/1.8from
jambayk:jambayk/lts-fix/multi-gpu
Conversation
Contributor
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 0f238cd (more details on the Dr. CI page):
🕵️ 5 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Add annotations | 🔁 rerun | |
| Shellcheck Jenkins scripts | 🔁 rerun | |
| Checkout pytorch/builder repo | 🔁 rerun | |
| Checkout pytorch/builder repo | 🔁 rerun |
❄️ 1 failure tentatively classified as flaky
but reruns have not yet been triggered to confirm:
pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)
Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️
Mar 04 03:08:42 RuntimeError: Process 0 terminated or timed out after 100.05959582328796 seconds
Mar 04 03:08:42 ======================================================================
Mar 04 03:08:42 ERROR [100.107s]: test_multiple_backward (__main__.TensorPipeDistAutogradTestWithSpawn)
Mar 04 03:08:42 ----------------------------------------------------------------------
Mar 04 03:08:42 Traceback (most recent call last):
Mar 04 03:08:42 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 282, in wrapper
Mar 04 03:08:42 self._join_processes(fn)
Mar 04 03:08:42 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 399, in _join_processes
Mar 04 03:08:42 self._check_return_codes(elapsed_time)
Mar 04 03:08:42 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 440, in _check_return_codes
Mar 04 03:08:42 raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Mar 04 03:08:42 RuntimeError: Process 0 terminated or timed out after 100.05959582328796 seconds
Mar 04 03:08:42
Mar 04 03:08:42 ----------------------------------------------------------------------
Mar 04 03:08:42 Ran 411 tests in 1289.906s
Mar 04 03:08:42
Mar 04 03:08:42 FAILED (errors=1, skipped=66)
Mar 04 03:08:42
Mar 04 03:08:42 Generating XML reports...
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpComparisonTestWithSpawn-20220304024712.xml
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpUnderDistAutogradTestWithSpawn-20220304024712.xml
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDistAutogradTestWithSpawn-20220304024712.xml
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
Summary: The test test_collect_shards fails on single GPU setup. Enabling the multi gpu checker. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: #53564 Reviewed By: H-Huang Differential Revision: D26952325 Pulled By: rohan-varma fbshipit-source-id: e8956f9277c7320024bece129767e83fbdf02b2c
Collaborator
Author
|
This has been rebased onto the |
seemethere
approved these changes
Mar 4, 2022
malfet
approved these changes
Mar 4, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PR cherry-picks the commit 2cf9098 from the master branch that skips the test
TestZeroRedundancyOptimizer.test_collect_shardsif not on multiple gpu.