ci: Move multigpu to periodic#79894
ci: Move multigpu to periodic#79894seemethere wants to merge 1 commit intogh/seemethere/253/basefrom
Conversation
We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> [ghstack-poisoned]
malfet
left a comment
There was a problem hiding this comment.
Do we have any signal on how frequently we had to revert due to that?
🔗 Helpful links
❌ 4 New Failures, 1 Base Failures, 2 PendingAs of commit 5302e30 (more details on the Dr. CI page): Expand to see more
🕵️ 4 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
|
@rohan-varma The code that increased the TTS is #77947. FYI this change will move all multigpu tests to periodic (every 4 hours) instead of every commit. If this sounds like something the distributed team would prefer not to do, the other alternative is removing tests from the multigpu config (like reverting the fsdp change). |
|
@rohan-varma I remember you said you had an opinion on moving this? |
|
Spoke with @janeyx99 offline, I think right now, we have to go ahead with the move to periodic, until we can decrease TTS to < 75min, after which it can be moved to master-only. |
|
@pytorchbot merge -f |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
@seemethere your PR has been successfully merged. |
|
Hey @seemethere. |
Summary: We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegasfb.com> Pull Request resolved: #79894 Approved by: https://github.com/malfet, https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/02d01707a6f40b7a189c3353010798766609df71 Reviewed By: atalman Differential Revision: D37327474 Pulled By: seemethere fbshipit-source-id: 525846ea9da645075f7778e7186753005f637f09
We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegasfb.com> Pull Request resolved: pytorch#79894 Approved by: https://github.com/malfet, https://github.com/janeyx99
We have hard limitations on the number of linux.16xlarge.nvidia.gpu machines we can spin up. Considering that the TTS for this specific job has increased 2x over the past 7 days. Signed-off-by: Eli Uriegas <eliuriegasfb.com> Pull Request resolved: pytorch#79894 Approved by: https://github.com/malfet, https://github.com/janeyx99
Stack from ghstack:
We have hard limitations on the number of linux.16xlarge.nvidia.gpu
machines we can spin up. Considering that the TTS for this specific job
has increased 2x over the past 7 days.
Signed-off-by: Eli Uriegas eliuriegas@fb.com