[ci] move rocm jobs from pull to trunk workflow#77989
Closed
suo wants to merge 1 commit intogh/suo/521/basefrom
Closed
[ci] move rocm jobs from pull to trunk workflow#77989suo wants to merge 1 commit intogh/suo/521/basefrom
suo wants to merge 1 commit intogh/suo/521/basefrom
Conversation
This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] [ghstack-poisoned]
Contributor
🔗 Helpful links
✅ No Failures (0 Pending)As of commit e9429ad (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
suo
added a commit
that referenced
this pull request
May 20, 2022
This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] ghstack-source-id: e5faeb3 Pull Request resolved: #77989
Member
Author
|
cc @jeffdaily |
malfet
approved these changes
May 20, 2022
Contributor
malfet
left a comment
There was a problem hiding this comment.
LGTM, although we need to investigate, what can be one to meet queueing expectations
Member
Author
|
Heads up @jeffdaily @jithunnair-amd I'm planning to merge this at the end of the day today, so if we have any other ideas about reducing queueing times that should prompt us to reconsider this change, please raise them before then. Thanks! |
janeyx99
approved these changes
May 20, 2022
Member
Author
|
@pytorchbot merge -f |
Member
Author
|
alright let's give it a shot |
facebook-github-bot
pushed a commit
that referenced
this pull request
May 24, 2022
Summary: This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] Pull Request resolved: #77989 Approved by: https://github.com/malfet, https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/1d845253d82c16a79c0737087f524a0896985a4c Reviewed By: seemethere Differential Revision: D36603002 Pulled By: seemethere fbshipit-source-id: fac619553e6d7819e1a58154570edf69f79bbcef
swang392
pushed a commit
that referenced
this pull request
May 25, 2022
This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] Pull Request resolved: #77989 Approved by: https://github.com/malfet, https://github.com/janeyx99
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack:
This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:
Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").
There are two things we haven't tried so far:
Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.
[skip ci]