[LTS] CherryPick: Add multi gpu checker for `TestZeroRedundancyOptimizer.test_collect_shards` by jambayk · Pull Request #72923 · pytorch/pytorch

jambayk · 2022-02-16T17:56:03Z

The PR cherry-picks the commit 2cf9098 from the master branch that skips the test TestZeroRedundancyOptimizer.test_collect_shards if not on multiple gpu.

facebook-github-bot · 2022-02-16T17:56:08Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/72923
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 0f238cd (more details on the Dr. CI page):

9/10 failures introduced in this PR
1/10 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 5 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_windows_vs2019_py36_cpu_build (1/5)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

ModuleNotFoundError: No module named 'yaml'

Building wheel torch-1.8.3a0+0f238cd
-- Building version 1.8.3a0+0f238cd
Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 368, in check_pydep
    importlib.import_module(importname)
  File "C:\Jenkins\Miniconda3\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 818, in <module>
    build_deps()
  File "C:\Users\circleci\project\setup.py", line 313, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "C:\Users\circleci\project\setup.py", line 370, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))

pytorch_windows_vs2019_py36_cuda10.1_build (2/5)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

ModuleNotFoundError: No module named 'yaml'

Building wheel torch-1.8.3a0+0f238cd
-- Building version 1.8.3a0+0f238cd
Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 368, in check_pydep
    importlib.import_module(importname)
  File "C:\Jenkins\Miniconda3\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 818, in <module>
    build_deps()
  File "C:\Users\circleci\project\setup.py", line 313, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "C:\Users\circleci\project\setup.py", line 370, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))

docker-pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7 (3/5)

Step: "Check if image should be built" (full log | diagnosis details | 🔁 rerun)

ERROR: Something has gone wrong and the previou... isn't available for the merge-base of your branch

+ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
+ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
f7b39adecf174c662ad36b226ce859d2feafc4d8
+++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
++ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
+ PREVIOUS_DOCKER_TAG=f7b39adecf174c662ad36b226ce859d2feafc4d8
+ [[ f7b39adecf174c662ad36b226ce859d2feafc4d8 = \f\7\b\3\9\a\d\e\c\f\1\7\4\c\6\6\2\a\d\3\6\b\2\2\6\c\e\8\5\9\d\2\f\e\a\f\c\4\d\8 ]]
+ echo 'ERROR: Something has gone wrong and the previous image isn'\''t available for the merge-base of your branch'
ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch
+ echo '       contact the PyTorch team to restore the original images'
       contact the PyTorch team to restore the original images
+ exit 1


Exited with code exit status 1

pytorch_windows_vs2019_py36_cuda11.1_build (4/5)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

ModuleNotFoundError: No module named 'yaml'

Building wheel torch-1.8.3a0+0f238cd
-- Building version 1.8.3a0+0f238cd
Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 368, in check_pydep
    importlib.import_module(importname)
  File "C:\Jenkins\Miniconda3\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 818, in <module>
    build_deps()
  File "C:\Users\circleci\project\setup.py", line 313, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "C:\Users\circleci\project\setup.py", line 370, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))

docker-pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7 (5/5)

Step: "Check if image should be built" (full log | diagnosis details | 🔁 rerun)

ERROR: Something has gone wrong and the previou... isn't available for the merge-base of your branch

+ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
+ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
f7b39adecf174c662ad36b226ce859d2feafc4d8
+++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
++ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
+ PREVIOUS_DOCKER_TAG=f7b39adecf174c662ad36b226ce859d2feafc4d8
+ [[ f7b39adecf174c662ad36b226ce859d2feafc4d8 = \f\7\b\3\9\a\d\e\c\f\1\7\4\c\6\6\2\a\d\3\6\b\2\2\6\c\e\8\5\9\d\2\f\e\a\f\c\4\d\8 ]]
+ echo 'ERROR: Something has gone wrong and the previous image isn'\''t available for the merge-base of your branch'
ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch
+ echo '       contact the PyTorch team to restore the original images'
       contact the PyTorch team to restore the original images
+ exit 1


Exited with code exit status 1

4 failures not recognized by patterns:

Job	Step	Action
^{Lint / flake8-py3}	^{Add annotations}	🔁 rerun
^{Lint / quick-checks}	^{Shellcheck Jenkins scripts}	🔁 rerun
^{binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test}	^{Checkout pytorch/builder repo}	🔁 rerun
^{binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test}	^{Checkout pytorch/builder repo}	🔁 rerun

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Mar 04 03:08:42 RuntimeError: Process 0 terminated or timed out after 100.05959582328796 seconds

Mar 04 03:08:42 ======================================================================
Mar 04 03:08:42 ERROR [100.107s]: test_multiple_backward (__main__.TensorPipeDistAutogradTestWithSpawn)
Mar 04 03:08:42 ----------------------------------------------------------------------
Mar 04 03:08:42 Traceback (most recent call last):
Mar 04 03:08:42   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 282, in wrapper
Mar 04 03:08:42     self._join_processes(fn)
Mar 04 03:08:42   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 399, in _join_processes
Mar 04 03:08:42     self._check_return_codes(elapsed_time)
Mar 04 03:08:42   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 440, in _check_return_codes
Mar 04 03:08:42     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Mar 04 03:08:42 RuntimeError: Process 0 terminated or timed out after 100.05959582328796 seconds
Mar 04 03:08:42 
Mar 04 03:08:42 ----------------------------------------------------------------------
Mar 04 03:08:42 Ran 411 tests in 1289.906s
Mar 04 03:08:42 
Mar 04 03:08:42 FAILED (errors=1, skipped=66)
Mar 04 03:08:42 
Mar 04 03:08:42 Generating XML reports...
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpComparisonTestWithSpawn-20220304024712.xml
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpUnderDistAutogradTestWithSpawn-20220304024712.xml
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDistAutogradTestWithSpawn-20220304024712.xml

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: The test test_collect_shards fails on single GPU setup. Enabling the multi gpu checker. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: #53564 Reviewed By: H-Huang Differential Revision: D26952325 Pulled By: rohan-varma fbshipit-source-id: e8956f9277c7320024bece129767e83fbdf02b2c

jambayk · 2022-03-03T23:19:27Z

This has been rebased onto the lts/release/1.8 branch (at commit 4a1a8b2)

jambayk requested review from mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 16, 2022 17:56

pytorch-bot Bot added the ciflow/default label Feb 16, 2022

facebook-github-bot added the cla signed label Feb 16, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 16, 2022

jambayk requested review from malfet and seemethere and removed request for mrshenli, pritamdamania87, rohan-varma and zhaojuanmao February 16, 2022 17:56

pytorchbot added the open source label Feb 16, 2022

jambayk removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 23, 2022

jaglinux and others added 2 commits March 3, 2022 15:15

fix skip_if_not_multigpu

0f238cd

seemethere approved these changes Mar 4, 2022

View reviewed changes

malfet approved these changes Mar 4, 2022

View reviewed changes

malfet merged commit 133673e into pytorch:lts/release/1.8 Mar 4, 2022

jambayk deleted the jambayk/lts-fix/multi-gpu branch March 4, 2022 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LTS] CherryPick: Add multi gpu checker for `TestZeroRedundancyOptimizer.test_collect_shards`#72923

[LTS] CherryPick: Add multi gpu checker for `TestZeroRedundancyOptimizer.test_collect_shards`#72923
malfet merged 2 commits intopytorch:lts/release/1.8from
jambayk:jambayk/lts-fix/multi-gpu

jambayk commented Feb 16, 2022

Uh oh!

facebook-github-bot commented Feb 16, 2022 •

edited

Loading

Uh oh!

jambayk commented Mar 3, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jambayk commented Feb 16, 2022

Uh oh!

facebook-github-bot commented Feb 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 5 new failures recognized by patterns

pytorch_windows_vs2019_py36_cpu_build (1/5)

pytorch_windows_vs2019_py36_cuda10.1_build (2/5)

docker-pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7 (3/5)

pytorch_windows_vs2019_py36_cuda11.1_build (4/5)

docker-pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7 (5/5)

4 failures not recognized by patterns:

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Uh oh!

jambayk commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

facebook-github-bot commented Feb 16, 2022 •

edited

Loading

jambayk commented Mar 3, 2022 •

edited

Loading