Skip to content

[FIX] RayJob doc test flaky#52170

Merged
jjyao merged 1 commit intoray-project:masterfrom
machichima:fix-rayjob-flaky-test
Apr 9, 2025
Merged

[FIX] RayJob doc test flaky#52170
jjyao merged 1 commit intoray-project:masterfrom
machichima:fix-rayjob-flaky-test

Conversation

@machichima
Copy link
Copy Markdown
Contributor

@machichima machichima commented Apr 9, 2025

Why are these changes needed?

As mentioned here: #51756 (comment), the doc test for rayjob is flaky.

Sort get pods output with create time to prevent flaky. Ensure the doc test passed locally twice.

image

Doc link: https://anyscale-ray--52170.com.readthedocs.build/en/52170/cluster/kubernetes/getting-started/rayjob-quick-start.html

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

-e
Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima requested review from a team, kevin85421 and pcmoritz as code owners April 9, 2025 11:17
@MortalHappiness MortalHappiness added the go add ONLY when ready to merge, run all tests label Apr 9, 2025
@kevin85421
Copy link
Copy Markdown
Member

Maybe next time, @MortalHappiness, you could ask contributors to run the tests ten times before approving the PR, in case some of us are not familiar with the doc tests and might miss something during review.

@MortalHappiness
Copy link
Copy Markdown
Member

MortalHappiness commented Apr 9, 2025

@kevin85421 Okay no problem. @machichima Could you post your screenshot of running ten times? Thanks.

"# Step 4.3: List all Pods in the `default` namespace.\n",
"# The Pod created by the Kubernetes Job will be terminated after the Kubernetes Job finishes.\n",
"kubectl get pods --sort-by=.metadata.name"
"kubectl get pods --sort-by='.metadata.creationTimestamp'"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machichima, can you help me understand where the logic for validating this cell's output is located? Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

I think it's because the last 5 characters are randomized, so if we sort by name, sometimes rayjob-sample-xxxxx comes before rayjob-sample-raycluster-yyyyy-head-zzzzz.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, make sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation!

@kevin85421
Copy link
Copy Markdown
Member

@kevin85421 Okay no problem. @machichima Could you post your screenshot of running ten times? Thanks.

Not necessarily for this PR. The test is flaky in CI and blocks multiple PRs. My point is that we can improve the review process by requesting contributors to run the tests multiple times locally to avoid this situation in the future.

Thank @MortalHappiness for the review and @machichima for the quick fix!

@jjyao jjyao merged commit 78b5967 into ray-project:master Apr 9, 2025
5 checks passed
@machichima
Copy link
Copy Markdown
Contributor Author

@machichima Seems like still flaky.

I will run more times locally and see if I can spot the reason for this flaky behaviour

@machichima
Copy link
Copy Markdown
Contributor Author

machichima commented Apr 10, 2025

Hi @MortalHappiness

I tried multiple times (7 times) and they all passed...

I don't actually know what might be the case, As the CI log seems to logging more info then I did. I am wondering if it might caused by using different version of something?

Attached my running log here: logging.log

UPDATE: I got kubectl version v1.32.3 while CI use version v1.28.4. Let me downgrade and try again

UPDATE: I use versions same as CI

  • kubectl version v1.28.4
  • python version 3.10.16
  • pytest-7.4.4
  • pluggy-1.3.0

and tried 7 times locally and they all passed. I couldn't reproduce the flaky behaviour locally. New logging is here: logging-new.log

han-steve pushed a commit to han-steve/ray that referenced this pull request Apr 11, 2025
Signed-off-by: Steve Han <stevehan2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants