Skip to content

ci: Move multigpu to periodic#79894

Closed
seemethere wants to merge 1 commit intogh/seemethere/253/basefrom
gh/seemethere/253/head
Closed

ci: Move multigpu to periodic#79894
seemethere wants to merge 1 commit intogh/seemethere/253/basefrom
gh/seemethere/253/head

Conversation

@seemethere
Copy link
Copy Markdown
Member

@seemethere seemethere commented Jun 20, 2022

Stack from ghstack:

We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

image

Signed-off-by: Eli Uriegas eliuriegas@fb.com

We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

[ghstack-poisoned]
@seemethere seemethere requested a review from a team as a code owner June 20, 2022 20:42
seemethere added a commit that referenced this pull request Jun 20, 2022
We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

Signed-off-by: Eli Uriegas <eliuriegasfb.com>

ghstack-source-id: 3c49505
Pull Request resolved: #79894
Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any signal on how frequently we had to revert due to that?

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jun 20, 2022

🔗 Helpful links

❌ 4 New Failures, 1 Base Failures, 2 Pending

As of commit 5302e30 (more details on the Dr. CI page):

Expand to see more
  • 4/5 failures introduced in this PR
  • 1/5 broken upstream at merge base f3665dd on Jun 20 from 12:50pm to 4:47pm

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (crossref, 1, 2, linux.2xlarge) (1/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T21:54:42.1962397Z RuntimeError: test_ops failed!
2022-06-20T21:54:41.3144534Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCPU-20220620212409.xml
2022-06-20T21:54:41.3587383Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCPU-20220620212409.xml
2022-06-20T21:54:41.4714564Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCPU-20220620212409.xml
2022-06-20T21:54:41.4834344Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestRefsOpsInfoCPU-20220620212409.xml
2022-06-20T21:54:41.5376523Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCPU-20220620212409.xml
2022-06-20T21:54:42.1955935Z Traceback (most recent call last):
2022-06-20T21:54:42.1956614Z   File "test/run_test.py", line 946, in <module>
2022-06-20T21:54:42.1959253Z     main()
2022-06-20T21:54:42.1959638Z   File "test/run_test.py", line 924, in main
2022-06-20T21:54:42.1961778Z     raise RuntimeError(err_message)
2022-06-20T21:54:42.1962397Z RuntimeError: test_ops failed!
2022-06-20T21:54:42.4327156Z 
2022-06-20T21:54:42.4327533Z real	30m38.236s
2022-06-20T21:54:42.4327881Z user	82m3.897s
2022-06-20T21:54:42.4328200Z sys	3m41.939s
2022-06-20T21:54:42.4328475Z + cleanup
2022-06-20T21:54:42.4328634Z + retcode=1
2022-06-20T21:54:42.4328793Z + set +x
2022-06-20T21:54:42.4371435Z ##[error]Process completed with exit code 1.
2022-06-20T21:54:42.4427699Z Prepare all required actions
2022-06-20T21:54:42.4428011Z Getting action download info

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (default, 1, 2, linux.2xlarge) (2/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T21:44:21.7041653Z RuntimeError: test_ops failed!
2022-06-20T21:44:20.7645478Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCPU-20220620212402.xml
2022-06-20T21:44:21.0180586Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCPU-20220620212402.xml
2022-06-20T21:44:21.1267430Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCPU-20220620212402.xml
2022-06-20T21:44:21.1385193Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestRefsOpsInfoCPU-20220620212402.xml
2022-06-20T21:44:21.1903983Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCPU-20220620212402.xml
2022-06-20T21:44:21.7036645Z Traceback (most recent call last):
2022-06-20T21:44:21.7037131Z   File "test/run_test.py", line 946, in <module>
2022-06-20T21:44:21.7038754Z     main()
2022-06-20T21:44:21.7038981Z   File "test/run_test.py", line 924, in main
2022-06-20T21:44:21.7041198Z     raise RuntimeError(err_message)
2022-06-20T21:44:21.7041653Z RuntimeError: test_ops failed!
2022-06-20T21:44:21.9340096Z 
2022-06-20T21:44:21.9340320Z real	20m24.539s
2022-06-20T21:44:21.9340571Z user	52m37.056s
2022-06-20T21:44:21.9340840Z sys	2m56.206s
2022-06-20T21:44:21.9341077Z + cleanup
2022-06-20T21:44:21.9342064Z + retcode=1
2022-06-20T21:44:21.9342319Z + set +x
2022-06-20T21:44:21.9384841Z ##[error]Process completed with exit code 1.
2022-06-20T21:44:21.9438628Z Prepare all required actions
2022-06-20T21:44:21.9438929Z Getting action download info

See GitHub Actions build pull / linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge) (3/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T21:46:47.4361581Z RuntimeError: test_ops failed!
2022-06-20T21:46:46.3198596Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCPU-20220620212543.xml
2022-06-20T21:46:46.5996713Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCPU-20220620212543.xml
2022-06-20T21:46:46.7086929Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCPU-20220620212543.xml
2022-06-20T21:46:46.7205833Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestRefsOpsInfoCPU-20220620212543.xml
2022-06-20T21:46:46.7724888Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCPU-20220620212543.xml
2022-06-20T21:46:47.4358058Z Traceback (most recent call last):
2022-06-20T21:46:47.4358494Z   File "test/run_test.py", line 946, in <module>
2022-06-20T21:46:47.4359446Z     main()
2022-06-20T21:46:47.4359757Z   File "test/run_test.py", line 924, in main
2022-06-20T21:46:47.4361176Z     raise RuntimeError(err_message)
2022-06-20T21:46:47.4361581Z RuntimeError: test_ops failed!
2022-06-20T21:46:47.7485236Z 
2022-06-20T21:46:47.7485555Z real	21m9.211s
2022-06-20T21:46:47.7485919Z user	39m3.753s
2022-06-20T21:46:47.7486232Z sys	1m28.142s
2022-06-20T21:46:47.7486481Z + cleanup
2022-06-20T21:46:47.7486644Z + retcode=1
2022-06-20T21:46:47.7487354Z + set +x
2022-06-20T21:46:47.7524885Z ##[error]Process completed with exit code 1.
2022-06-20T21:46:47.7560141Z Prepare all required actions
2022-06-20T21:46:47.7560450Z Getting action download info

See GitHub Actions build pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (4/4)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-20T23:29:31.3764111Z RuntimeError: test_ops failed!
2022-06-20T23:29:29.5928457Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCommonCUDA-20220620214015.xml
2022-06-20T23:29:29.7682039Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCompositeComplianceCUDA-20220620214015.xml
2022-06-20T23:29:29.9315853Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestMathBitsCUDA-20220620214015.xml
2022-06-20T23:29:30.0103022Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestFakeTensorNonErroringCUDA-20220620214015.xml
2022-06-20T23:29:30.1097214Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestTagsCUDA-20220620214015.xml
2022-06-20T23:29:31.3757303Z Traceback (most recent call last):
2022-06-20T23:29:31.3757690Z   File "test/run_test.py", line 946, in <module>
2022-06-20T23:29:31.3761019Z     main()
2022-06-20T23:29:31.3761378Z   File "test/run_test.py", line 924, in main
2022-06-20T23:29:31.3763783Z     raise RuntimeError(err_message)
2022-06-20T23:29:31.3764111Z RuntimeError: test_ops failed!
2022-06-20T23:29:32.0854816Z + cleanup
2022-06-20T23:29:32.0855087Z + retcode=1
2022-06-20T23:29:32.0855333Z + set +x
2022-06-20T23:29:32.0905081Z ##[error]Process completed with exit code 1.
2022-06-20T23:29:32.0945807Z Prepare all required actions
2022-06-20T23:29:32.0946232Z Getting action download info
2022-06-20T23:29:32.3059392Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-06-20T23:29:32.3059694Z with:
2022-06-20T23:29:32.3060568Z   github-token: ***
2022-06-20T23:29:32.3060842Z env:

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@janeyx99
Copy link
Copy Markdown
Contributor

@rohan-varma The code that increased the TTS is #77947. FYI this change will move all multigpu tests to periodic (every 4 hours) instead of every commit. If this sounds like something the distributed team would prefer not to do, the other alternative is removing tests from the multigpu config (like reverting the fsdp change).

@seemethere
Copy link
Copy Markdown
Member Author

@rohan-varma I remember you said you had an opinion on moving this?

@rohan-varma
Copy link
Copy Markdown
Contributor

@seemethere @janeyx99

Spoke with @janeyx99 offline, I think right now, we have to go ahead with the move to periodic, until we can decrease TTS to < 75min, after which it can be moved to master-only.

@seemethere
Copy link
Copy Markdown
Member Author

@pytorchbot merge -f

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@seemethere your PR has been successfully merged.

@github-actions
Copy link
Copy Markdown
Contributor

Hey @seemethere.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2022
Summary:
We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

Signed-off-by: Eli Uriegas <eliuriegasfb.com>

Pull Request resolved: #79894

Approved by: https://github.com/malfet, https://github.com/janeyx99

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/02d01707a6f40b7a189c3353010798766609df71

Reviewed By: atalman

Differential Revision: D37327474

Pulled By: seemethere

fbshipit-source-id: 525846ea9da645075f7778e7186753005f637f09
@facebook-github-bot facebook-github-bot deleted the gh/seemethere/253/head branch June 25, 2022 14:16
miladm pushed a commit to miladm/pytorch that referenced this pull request Jun 27, 2022
We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

Signed-off-by: Eli Uriegas <eliuriegasfb.com>

Pull Request resolved: pytorch#79894

Approved by: https://github.com/malfet, https://github.com/janeyx99
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.

Signed-off-by: Eli Uriegas <eliuriegasfb.com>

Pull Request resolved: pytorch#79894

Approved by: https://github.com/malfet, https://github.com/janeyx99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants