Skip to content

Mark job as unstable dynamically#102426

Closed
huydhn wants to merge 9 commits intopytorch:mainfrom
huydhn:add-filter-support-unstable-mode
Closed

Mark job as unstable dynamically#102426
huydhn wants to merge 9 commits intopytorch:mainfrom
huydhn:add-filter-support-unstable-mode

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 27, 2023

Allow CI jobs to be marked as unstable dynamically. This use the same mechanism to disable job but with a different issue title UNSTABLE JOB_NAME.

The action will output a is-unstable flag to let the CI know if the current job it's running is unstable. This is similar to the way keep-going flag is exposed. Once this is merged, I will follow up with another PR to actually use is-unstable flag in CI.

Testing

@huydhn huydhn added the keep-going Don't stop on first failure, keep running tests until the end label May 27, 2023
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented May 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102426

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 9181bd5:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@huydhn huydhn requested a review from clee2000 May 29, 2023 18:41
@huydhn huydhn marked this pull request as ready for review May 29, 2023 19:07
@huydhn huydhn requested a review from a team as a code owner May 29, 2023 19:07
@huydhn huydhn requested a review from ZainRizvi May 30, 2023 19:29
@huydhn
Copy link
Copy Markdown
Contributor Author

huydhn commented May 31, 2023

Per the discussion with @clee2000, there will be another PR to update trymerge to handle unstable failures:

  • We still want unstable jobs to fail to gather their signals
  • Unstable failure shouldn't block merge

An unstable test job will have unstable in its name. A build job will be treated as unstable if all the test jobs depending on it are unstable. This is similar to the way the build and test jobs are copied over to unstable workflow, i.e. a test job is unstable, we would copy both the job and its build job.

@huydhn
Copy link
Copy Markdown
Contributor Author

huydhn commented Jun 1, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

huydhn added a commit to pytorch/test-infra that referenced this pull request Jun 2, 2023
After pytorch/pytorch#102426 and
pytorch/pytorch#102784 landed, unstable jobs are
now hidden correctly on HUD https://hud.pytorch.org and also won't block
PR. Previously, this was done by moving unstable jobs to an unstable
workflow. Now unstable jobs will stay in the same workflow, but have
`unstable` in their names.

This is very similar to how `rerun_disabled_tests` are ignored atm.

### Testing


https://torchci-git-fork-huydhn-ignore-unstable-jobs-fbopensource.vercel.app/metrics
pytorchmergebot pushed a commit that referenced this pull request Jun 4, 2023
Per title, after #102426 landed, it makes sense to have a new category for UNSTABLE jobs and handle them accordingly in trymerge.

* The simple approach is to check for `unstable` in the check (job) name.  I plan to roll this out first and then see if we need to cover the more complicated, but less popular case, of unstable build job.  Specifically, an unstable build job has no `unstable` in its name
* An unstable job is ignored by trymerge.  This is the same behavior we have atm when a job is moved to unstable.  It's completely ignored
* The update to Dr. CI will come later, so that unstable failures would also be hidden like broken trunk or flaky

### Testing

Leverage the broken trunk Windows CPU job atm and mark Windows CPU jobs as unstable #102297
Pull Request resolved: #102784
Approved by: https://github.com/clee2000
PaliC pushed a commit to pytorch/test-infra that referenced this pull request Jun 5, 2023
After pytorch/pytorch#102426 and
pytorch/pytorch#102784 landed, unstable jobs are
now hidden correctly on HUD https://hud.pytorch.org and also won't block
PR. Previously, this was done by moving unstable jobs to an unstable
workflow. Now unstable jobs will stay in the same workflow, but have
`unstable` in their names.

This is very similar to how `rerun_disabled_tests` are ignored atm.

### Testing


https://torchci-git-fork-huydhn-ignore-unstable-jobs-fbopensource.vercel.app/metrics
@huydhn huydhn deleted the add-filter-support-unstable-mode branch May 28, 2025 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged test-config/backwards_compat test-config/default test-config/dynamo Use this label to run only dynamo tests topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants