[core] ignore jobs deleted by the platform in the long_running_many_jobs.py test#61690
[core] ignore jobs deleted by the platform in the long_running_many_jobs.py test#61690rueian wants to merge 1 commit intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies a long-running test to be more resilient against jobs being deleted by the platform, which could cause 'not found' errors and test failures. The changes introduce helper functions to gracefully handle these errors by treating a deleted job as a success or by ignoring the error when fetching logs/info. The overall approach is sound and directly addresses the problem described. I have one suggestion to reduce code duplication.
…obs.py test Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
64e4e64 to
f0b3d3f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0b3d3f42b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
MengjinYan
left a comment
There was a problem hiding this comment.
From what I see here: https://github.com/anyscale/product/blob/master/go/infra/activityprobe/ray/job_utils.go#L55-L65, the job will be deleted no matter if it succeeds or failed. If that's the case, I think the test might miss some failure cases.
I think the better idea if possible is to configure the test environment to not clean up jobs or only cleanup successful jobs instead of changing to ignore the delete jobs.
What do you think?
Sounds good. Let me ask the team if that is possible internally. |
|
Yep, they said they have raised the cleanup threshold to 100k. |
Description
One of the reasons that
long_running_many_jobstest is flaky is that there is an external process (Go-http-client/1.1) on the platform, which we have no control over, that will kill jobs:and let the test script fail with:
This PR fixes it by ignoring the HTTP 404 error.
Related issues
Fixes anyscale#700