Skip to content

[core] Use a signal actor instead of sleep to avoid flakiness in extreme cases#52121

Merged
jjyao merged 6 commits intomasterfrom
laptop-ray3-20250408
Apr 10, 2025
Merged

[core] Use a signal actor instead of sleep to avoid flakiness in extreme cases#52121
jjyao merged 6 commits intomasterfrom
laptop-ray3-20250408

Conversation

@kevin85421
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 commented Apr 9, 2025

Why are these changes needed?

Follow up of #51904 (comment).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Apr 9, 2025
@kevin85421
Copy link
Copy Markdown
Member Author

CI failures seem to be unrelated.

@kevin85421 kevin85421 marked this pull request as ready for review April 9, 2025 04:45
Comment on lines +1177 to +1178
ray.get(signal_actor.send.remote(clear=True))
ray.get(signal_actor.wait.remote())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to use two different signal actors otherwise if signal_actor.send.remote(clear=True) happens before ray.get(signal_actor.wait.remote()), the signal is lost.

Copy link
Copy Markdown
Member Author

@kevin85421 kevin85421 Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed clear=True from the last send in these two tests instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I eventually decided to use two signal actors because I found that I still needed two actors even if I use semaphores.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@kevin85421
Copy link
Copy Markdown
Member Author

CI failures seem to be unrelated.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@jjyao jjyao merged commit b3c6fda into master Apr 10, 2025
5 checks passed
@jjyao jjyao deleted the laptop-ray3-20250408 branch April 10, 2025 05:08
han-steve pushed a commit to han-steve/ray that referenced this pull request Apr 11, 2025
…eme cases (ray-project#52121)

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Steve Han <stevehan2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants