Skip to content

[core] ignore jobs deleted by the platform in the long_running_many_jobs.py test#61690

Open
rueian wants to merge 1 commit intoray-project:masterfrom
rueian:ignore-deleted-jobs-in-tests
Open

[core] ignore jobs deleted by the platform in the long_running_many_jobs.py test#61690
rueian wants to merge 1 commit intoray-project:masterfrom
rueian:ignore-deleted-jobs-in-tests

Conversation

@rueian
Copy link
Copy Markdown
Contributor

@rueian rueian commented Mar 12, 2026

Description

One of the reasons that long_running_many_jobs test is flaky is that there is an external process (Go-http-client/1.1) on the platform, which we have no control over, that will kill jobs:

image

and let the test script fail with:

image

This PR fixes it by ignoring the HTTP 404 error.

Related issues

Fixes anyscale#700

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies a long-running test to be more resilient against jobs being deleted by the platform, which could cause 'not found' errors and test failures. The changes introduce helper functions to gracefully handle these errors by treating a deleted job as a success or by ignoring the error when fetching logs/info. The overall approach is sound and directly addresses the problem described. I have one suggestion to reduce code duplication.

…obs.py test

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@rueian rueian force-pushed the ignore-deleted-jobs-in-tests branch from 64e4e64 to f0b3d3f Compare March 12, 2026 21:32
@rueian rueian marked this pull request as ready for review March 12, 2026 23:39
@rueian rueian added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Mar 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0b3d3f42b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see here: https://github.com/anyscale/product/blob/master/go/infra/activityprobe/ray/job_utils.go#L55-L65, the job will be deleted no matter if it succeeds or failed. If that's the case, I think the test might miss some failure cases.

I think the better idea if possible is to configure the test environment to not clean up jobs or only cleanup successful jobs instead of changing to ignore the delete jobs.

What do you think?

@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Mar 23, 2026

From what I see here: https://github.com/anyscale/product/blob/master/go/infra/activityprobe/ray/job_utils.go#L55-L65, the job will be deleted no matter if it succeeds or failed. If that's the case, I think the test might miss some failure cases.

I think the better idea if possible is to configure the test environment to not clean up jobs or only cleanup successful jobs instead of changing to ignore the delete jobs.

What do you think?

Sounds good. Let me ask the team if that is possible internally.

@rueian
Copy link
Copy Markdown
Contributor Author

rueian commented Mar 24, 2026

Yep, they said they have raised the cleanup threshold to 100k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release test long_running_many_jobs.aws failed

2 participants