Skip to content

[LTS] CherryPick: Add multi gpu checker for TestZeroRedundancyOptimizer.test_collect_shards#72923

Merged
malfet merged 2 commits intopytorch:lts/release/1.8from
jambayk:jambayk/lts-fix/multi-gpu
Mar 4, 2022
Merged

[LTS] CherryPick: Add multi gpu checker for TestZeroRedundancyOptimizer.test_collect_shards#72923
malfet merged 2 commits intopytorch:lts/release/1.8from
jambayk:jambayk/lts-fix/multi-gpu

Conversation

@jambayk
Copy link
Copy Markdown
Collaborator

@jambayk jambayk commented Feb 16, 2022

The PR cherry-picks the commit 2cf9098 from the master branch that skips the test TestZeroRedundancyOptimizer.test_collect_shards if not on multiple gpu.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Feb 16, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 0f238cd (more details on the Dr. CI page):



🕵️ 5 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cpu_build (1/5)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

ModuleNotFoundError: No module named 'yaml'
Building wheel torch-1.8.3a0+0f238cd
-- Building version 1.8.3a0+0f238cd
Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 368, in check_pydep
    importlib.import_module(importname)
  File "C:\Jenkins\Miniconda3\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 818, in <module>
    build_deps()
  File "C:\Users\circleci\project\setup.py", line 313, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "C:\Users\circleci\project\setup.py", line 370, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_build (2/5)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

ModuleNotFoundError: No module named 'yaml'
Building wheel torch-1.8.3a0+0f238cd
-- Building version 1.8.3a0+0f238cd
Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 368, in check_pydep
    importlib.import_module(importname)
  File "C:\Jenkins\Miniconda3\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 818, in <module>
    build_deps()
  File "C:\Users\circleci\project\setup.py", line 313, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "C:\Users\circleci\project\setup.py", line 370, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))

See CircleCI build docker-pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7 (3/5)

Step: "Check if image should be built" (full log | diagnosis details | 🔁 rerun)

ERROR: Something has gone wrong and the previou... isn't available for the merge-base of your branch
+ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
+ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
f7b39adecf174c662ad36b226ce859d2feafc4d8
+++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
++ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
+ PREVIOUS_DOCKER_TAG=f7b39adecf174c662ad36b226ce859d2feafc4d8
+ [[ f7b39adecf174c662ad36b226ce859d2feafc4d8 = \f\7\b\3\9\a\d\e\c\f\1\7\4\c\6\6\2\a\d\3\6\b\2\2\6\c\e\8\5\9\d\2\f\e\a\f\c\4\d\8 ]]
+ echo 'ERROR: Something has gone wrong and the previous image isn'\''t available for the merge-base of your branch'
ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch
+ echo '       contact the PyTorch team to restore the original images'
       contact the PyTorch team to restore the original images
+ exit 1


Exited with code exit status 1

See CircleCI build pytorch_windows_vs2019_py36_cuda11.1_build (4/5)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

ModuleNotFoundError: No module named 'yaml'
Building wheel torch-1.8.3a0+0f238cd
-- Building version 1.8.3a0+0f238cd
Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 368, in check_pydep
    importlib.import_module(importname)
  File "C:\Jenkins\Miniconda3\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\circleci\project\setup.py", line 818, in <module>
    build_deps()
  File "C:\Users\circleci\project\setup.py", line 313, in build_deps
    check_pydep('yaml', 'pyyaml')
  File "C:\Users\circleci\project\setup.py", line 370, in check_pydep
    raise RuntimeError(missing_pydep.format(importname=importname, module=module))

See CircleCI build docker-pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7 (5/5)

Step: "Check if image should be built" (full log | diagnosis details | 🔁 rerun)

ERROR: Something has gone wrong and the previou... isn't available for the merge-base of your branch
+ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
no such manifest: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7:f7b39adecf174c662ad36b226ce859d2feafc4d8
++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
+ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
f7b39adecf174c662ad36b226ce859d2feafc4d8
+++ git merge-base HEAD 4a1a8b285e207e35b063e863af1e1d23c918842f
++ git rev-parse 4a1a8b285e207e35b063e863af1e1d23c918842f:.circleci/docker
+ PREVIOUS_DOCKER_TAG=f7b39adecf174c662ad36b226ce859d2feafc4d8
+ [[ f7b39adecf174c662ad36b226ce859d2feafc4d8 = \f\7\b\3\9\a\d\e\c\f\1\7\4\c\6\6\2\a\d\3\6\b\2\2\6\c\e\8\5\9\d\2\f\e\a\f\c\4\d\8 ]]
+ echo 'ERROR: Something has gone wrong and the previous image isn'\''t available for the merge-base of your branch'
ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch
+ echo '       contact the PyTorch team to restore the original images'
       contact the PyTorch team to restore the original images
+ exit 1


Exited with code exit status 1


4 failures not recognized by patterns:

Job Step Action
GitHub Actions Lint / flake8-py3 Add annotations 🔁 rerun
GitHub Actions Lint / quick-checks Shellcheck Jenkins scripts 🔁 rerun
CircleCI binary_linux_libtorch_3_7m_cpu_gcc5_4_cxx11-abi_shared-with-deps_test Checkout pytorch/builder repo 🔁 rerun
CircleCI binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_test Checkout pytorch/builder repo 🔁 rerun

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Mar 04 03:08:42 RuntimeError: Process 0 terminated or timed out after 100.05959582328796 seconds
Mar 04 03:08:42 ======================================================================
Mar 04 03:08:42 ERROR [100.107s]: test_multiple_backward (__main__.TensorPipeDistAutogradTestWithSpawn)
Mar 04 03:08:42 ----------------------------------------------------------------------
Mar 04 03:08:42 Traceback (most recent call last):
Mar 04 03:08:42   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 282, in wrapper
Mar 04 03:08:42     self._join_processes(fn)
Mar 04 03:08:42   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 399, in _join_processes
Mar 04 03:08:42     self._check_return_codes(elapsed_time)
Mar 04 03:08:42   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 440, in _check_return_codes
Mar 04 03:08:42     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Mar 04 03:08:42 RuntimeError: Process 0 terminated or timed out after 100.05959582328796 seconds
Mar 04 03:08:42 
Mar 04 03:08:42 ----------------------------------------------------------------------
Mar 04 03:08:42 Ran 411 tests in 1289.906s
Mar 04 03:08:42 
Mar 04 03:08:42 FAILED (errors=1, skipped=66)
Mar 04 03:08:42 
Mar 04 03:08:42 Generating XML reports...
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpComparisonTestWithSpawn-20220304024712.xml
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDdpUnderDistAutogradTestWithSpawn-20220304024712.xml
Mar 04 03:08:42 Generated XML report: test-reports/dist-gloo/TEST-TensorPipeDistAutogradTestWithSpawn-20220304024712.xml

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 16, 2022
@jambayk jambayk requested review from malfet and seemethere and removed request for mrshenli, pritamdamania87, rohan-varma and zhaojuanmao February 16, 2022 17:56
@jambayk jambayk removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 23, 2022
jaglinux and others added 2 commits March 3, 2022 15:15
Summary:
The test test_collect_shards fails on single GPU setup.
Enabling the multi gpu checker.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: #53564

Reviewed By: H-Huang

Differential Revision: D26952325

Pulled By: rohan-varma

fbshipit-source-id: e8956f9277c7320024bece129767e83fbdf02b2c
@jambayk
Copy link
Copy Markdown
Collaborator Author

jambayk commented Mar 3, 2022

This has been rebased onto the lts/release/1.8 branch (at commit 4a1a8b2)

@malfet malfet merged commit 133673e into pytorch:lts/release/1.8 Mar 4, 2022
@jambayk jambayk deleted the jambayk/lts-fix/multi-gpu branch March 4, 2022 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants