ci: Move multigpu to periodic by seemethere · Pull Request #79894 · pytorch/pytorch

seemethere · 2022-06-20T20:42:04Z

Stack from ghstack:

-> ci: Move multigpu to periodic #79894

We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

Signed-off-by: Eli Uriegas eliuriegas@fb.com

We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> [ghstack-poisoned]

We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegasfb.com> ghstack-source-id: 3c49505 Pull Request resolved: #79894

malfet

Do we have any signal on how frequently we had to revert due to that?

facebook-github-bot · 2022-06-20T21:54:12Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79894
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 4 New Failures, 1 Base Failures, 2 Pending

As of commit 5302e30 (more details on the Dr. CI page):

Expand to see more

4/5 failures introduced in this PR
1/5 broken upstream at merge base f3665dd on Jun 20 from 12:50pm to 4:47pm

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge) (1/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T21:54:42.1962397Z RuntimeError: test_ops failed!

2022-06-20T21:54:41.3144534Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCPU-20220620212409.xml
2022-06-20T21:54:41.3587383Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCPU-20220620212409.xml
2022-06-20T21:54:41.4714564Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCPU-20220620212409.xml
2022-06-20T21:54:41.4834344Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestRefsOpsInfoCPU-20220620212409.xml
2022-06-20T21:54:41.5376523Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCPU-20220620212409.xml
2022-06-20T21:54:42.1955935Z Traceback (most recent call last):
2022-06-20T21:54:42.1956614Z   File "test/run_test.py", line 946, in <module>
2022-06-20T21:54:42.1959253Z     main()
2022-06-20T21:54:42.1959638Z   File "test/run_test.py", line 924, in main
2022-06-20T21:54:42.1961778Z     raise RuntimeError(err_message)
2022-06-20T21:54:42.1962397Z RuntimeError: test_ops failed!
2022-06-20T21:54:42.4327156Z 
2022-06-20T21:54:42.4327533Z real	30m38.236s
2022-06-20T21:54:42.4327881Z user	82m3.897s
2022-06-20T21:54:42.4328200Z sys	3m41.939s
2022-06-20T21:54:42.4328475Z + cleanup
2022-06-20T21:54:42.4328634Z + retcode=1
2022-06-20T21:54:42.4328793Z + set +x
2022-06-20T21:54:42.4371435Z ##[error]Process completed with exit code 1.
2022-06-20T21:54:42.4427699Z Prepare all required actions
2022-06-20T21:54:42.4428011Z Getting action download info

pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (2/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T21:44:21.7041653Z RuntimeError: test_ops failed!

2022-06-20T21:44:20.7645478Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCPU-20220620212402.xml
2022-06-20T21:44:21.0180586Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCPU-20220620212402.xml
2022-06-20T21:44:21.1267430Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCPU-20220620212402.xml
2022-06-20T21:44:21.1385193Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestRefsOpsInfoCPU-20220620212402.xml
2022-06-20T21:44:21.1903983Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCPU-20220620212402.xml
2022-06-20T21:44:21.7036645Z Traceback (most recent call last):
2022-06-20T21:44:21.7037131Z   File "test/run_test.py", line 946, in <module>
2022-06-20T21:44:21.7038754Z     main()
2022-06-20T21:44:21.7038981Z   File "test/run_test.py", line 924, in main
2022-06-20T21:44:21.7041198Z     raise RuntimeError(err_message)
2022-06-20T21:44:21.7041653Z RuntimeError: test_ops failed!
2022-06-20T21:44:21.9340096Z 
2022-06-20T21:44:21.9340320Z real	20m24.539s
2022-06-20T21:44:21.9340571Z user	52m37.056s
2022-06-20T21:44:21.9340840Z sys	2m56.206s
2022-06-20T21:44:21.9341077Z + cleanup
2022-06-20T21:44:21.9342064Z + retcode=1
2022-06-20T21:44:21.9342319Z + set +x
2022-06-20T21:44:21.9384841Z ##[error]Process completed with exit code 1.
2022-06-20T21:44:21.9438628Z Prepare all required actions
2022-06-20T21:44:21.9438929Z Getting action download info

pull / linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T21:46:47.4361581Z RuntimeError: test_ops failed!

2022-06-20T21:46:46.3198596Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCPU-20220620212543.xml
2022-06-20T21:46:46.5996713Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCPU-20220620212543.xml
2022-06-20T21:46:46.7086929Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCPU-20220620212543.xml
2022-06-20T21:46:46.7205833Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestRefsOpsInfoCPU-20220620212543.xml
2022-06-20T21:46:46.7724888Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCPU-20220620212543.xml
2022-06-20T21:46:47.4358058Z Traceback (most recent call last):
2022-06-20T21:46:47.4358494Z   File "test/run_test.py", line 946, in <module>
2022-06-20T21:46:47.4359446Z     main()
2022-06-20T21:46:47.4359757Z   File "test/run_test.py", line 924, in main
2022-06-20T21:46:47.4361176Z     raise RuntimeError(err_message)
2022-06-20T21:46:47.4361581Z RuntimeError: test_ops failed!
2022-06-20T21:46:47.7485236Z 
2022-06-20T21:46:47.7485555Z real	21m9.211s
2022-06-20T21:46:47.7485919Z user	39m3.753s
2022-06-20T21:46:47.7486232Z sys	1m28.142s
2022-06-20T21:46:47.7486481Z + cleanup
2022-06-20T21:46:47.7486644Z + retcode=1
2022-06-20T21:46:47.7487354Z + set +x
2022-06-20T21:46:47.7524885Z ##[error]Process completed with exit code 1.
2022-06-20T21:46:47.7560141Z Prepare all required actions
2022-06-20T21:46:47.7560450Z Getting action download info

pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (4/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T23:29:31.3764111Z RuntimeError: test_ops failed!

2022-06-20T23:29:29.5928457Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCommonCUDA-20220620214015.xml
2022-06-20T23:29:29.7682039Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCUDA-20220620214015.xml
2022-06-20T23:29:29.9315853Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCUDA-20220620214015.xml
2022-06-20T23:29:30.0103022Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCUDA-20220620214015.xml
2022-06-20T23:29:30.1097214Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCUDA-20220620214015.xml
2022-06-20T23:29:31.3757303Z Traceback (most recent call last):
2022-06-20T23:29:31.3757690Z   File "test/run_test.py", line 946, in <module>
2022-06-20T23:29:31.3761019Z     main()
2022-06-20T23:29:31.3761378Z   File "test/run_test.py", line 924, in main
2022-06-20T23:29:31.3763783Z     raise RuntimeError(err_message)
2022-06-20T23:29:31.3764111Z RuntimeError: test_ops failed!
2022-06-20T23:29:32.0854816Z + cleanup
2022-06-20T23:29:32.0855087Z + retcode=1
2022-06-20T23:29:32.0855333Z + set +x
2022-06-20T23:29:32.0905081Z ##[error]Process completed with exit code 1.
2022-06-20T23:29:32.0945807Z Prepare all required actions
2022-06-20T23:29:32.0946232Z Getting action download info
2022-06-20T23:29:32.3059392Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-06-20T23:29:32.3059694Z with:
2022-06-20T23:29:32.3060568Z   github-token: ***
2022-06-20T23:29:32.3060842Z env:

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) on Jun 20 from 12:50pm to 4:47pm (f3665dd - f5eb05f)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

janeyx99 · 2022-06-20T22:07:15Z

@rohan-varma The code that increased the TTS is #77947. FYI this change will move all multigpu tests to periodic (every 4 hours) instead of every commit. If this sounds like something the distributed team would prefer not to do, the other alternative is removing tests from the multigpu config (like reverting the fsdp change).

seemethere · 2022-06-21T16:14:57Z

@rohan-varma I remember you said you had an opinion on moving this?

rohan-varma · 2022-06-21T20:15:30Z

@seemethere @janeyx99

Spoke with @janeyx99 offline, I think right now, we have to go ahead with the move to periodic, until we can decrease TTS to < 75min, after which it can be moved to master-only.

seemethere · 2022-06-21T21:02:08Z

@pytorchbot merge -f

pytorchmergebot · 2022-06-21T21:03:22Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-21T21:03:29Z

@seemethere your PR has been successfully merged.

github-actions · 2022-06-21T21:04:36Z

Hey @seemethere.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegasfb.com> Pull Request resolved: #79894 Approved by: https://github.com/malfet, https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/02d01707a6f40b7a189c3353010798766609df71 Reviewed By: atalman Differential Revision: D37327474 Pulled By: seemethere fbshipit-source-id: 525846ea9da645075f7778e7186753005f637f09

We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegasfb.com> Pull Request resolved: pytorch#79894 Approved by: https://github.com/malfet, https://github.com/janeyx99

ci: Move multigpu to periodic

5302e30

We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> [ghstack-poisoned]

seemethere requested a review from a team as a code owner June 20, 2022 20:42

malfet approved these changes Jun 20, 2022

View reviewed changes

janeyx99 approved these changes Jun 20, 2022

View reviewed changes

facebook-github-bot added the cla signed label Jun 20, 2022

pytorchmergebot added the Merged label Jun 21, 2022

pytorchmergebot closed this in 02d0170 Jun 21, 2022

facebook-github-bot deleted the gh/seemethere/253/head branch June 25, 2022 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Move multigpu to periodic#79894

ci: Move multigpu to periodic#79894
seemethere wants to merge 1 commit intogh/seemethere/253/basefrom
gh/seemethere/253/head

seemethere commented Jun 20, 2022 •

edited

Loading

Uh oh!

malfet left a comment

Uh oh!

facebook-github-bot commented Jun 20, 2022 •

edited

Loading

🕵️ 4 new failures recognized by patterns

pull / linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge) (1/4)

pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (2/4)

pull / linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (3/4)

pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (4/4)

🚧 1 fixed upstream failure:

Uh oh!

janeyx99 commented Jun 20, 2022

Uh oh!

seemethere commented Jun 21, 2022

Uh oh!

rohan-varma commented Jun 21, 2022

Uh oh!

seemethere commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

github-actions Bot commented Jun 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

seemethere commented Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 4 New Failures, 1 Base Failures, 2 Pending

🕵️ 4 new failures recognized by patterns

pull / linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge) (1/4)

pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (2/4)

pull / linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (3/4)

pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (4/4)

🚧 1 fixed upstream failure:

Uh oh!

janeyx99 commented Jun 20, 2022

Uh oh!

seemethere commented Jun 21, 2022

Uh oh!

rohan-varma commented Jun 21, 2022

Uh oh!

seemethere commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

github-actions Bot commented Jun 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

seemethere commented Jun 20, 2022 •

edited

Loading

facebook-github-bot commented Jun 20, 2022 •

edited

Loading