Skip to content

[core] Remove perf test in shutdown_coordinator_test#56033

Merged
dayshah merged 2 commits intoray-project:masterfrom
codope:fix-shutdown-coordinator-perf-test
Aug 28, 2025
Merged

[core] Remove perf test in shutdown_coordinator_test#56033
dayshah merged 2 commits intoray-project:masterfrom
codope:fix-shutdown-coordinator-perf-test

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Aug 28, 2025

Why are these changes needed?

The test microbenchmarks a method (ShutdownCoordinator::ShouldEarlyExit) which uses a lock. It's checked at task boundaries and event loop posts. That's orders of magnitude less frequent than per-object/serialization loops. But, it is not very useful. So, removing the test.

Related issue number

Closes #55801

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope requested a review from a team as a code owner August 28, 2025 04:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly disables a performance test under TSAN to prevent spurious failures. The change to use std::chrono::steady_clock instead of std::chrono::high_resolution_clock is an excellent improvement for ensuring monotonic time measurement in the benchmark. The code is clean and the changes are well-justified.

volatile bool result = false;

for (int i = 0; i < iterations; ++i) {
result = coordinator->ShouldEarlyExit();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to microbenchmark this? If we want to do proper microbenchmark, we should use libraries like https://github.com/google/benchmark since writing good and reliable benchmark is pretty hard.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree especially not in our ci env

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShouldEarlyExit() sits on hot paths (e.g., CoreWorker::IsExiting() checks inside tight loops). In an earlier (unmerged) implementation of this method, I was using more complex atomic flag and memory ordering. It was simplified to simply take a lock but there was a perf concern. So, I added this test to get a sense of how much time it takes. I can remove it form unit test. Don't see a very good utility in running this now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya i'd say kill it

Copy link
Copy Markdown
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still don't get why this doesn't pass tsan, is there inherently something unsafe with ShouldEarlyExit. I think I'm just missing context on this test and why we need to test should exit perf

auto start = std::chrono::high_resolution_clock::now();
auto start = std::chrono::steady_clock::now();
constexpr int iterations = 1000000;
volatile bool result = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need volatile

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent the compiler from optimizing away the calls/loop. Without volatile (or another optimizer barrier), the compiler may elide repeated reads with no observable side effects.

volatile bool result = false;

for (int i = 0; i < iterations; ++i) {
result = coordinator->ShouldEarlyExit();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just wrap this in void( or RAY_UNUSED instead of having the unused result var

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah can do, however, void/RAY_UNUSED only silence "unused variable" warnings; they don’t prevent call elision. So, i can keep volatile to force evaluation; keep (void)result or RAY_UNUSED(result); to silence the warning.
But, I am considering removing the test altogether -- #56033 (comment)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Copy link
Copy Markdown
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update PR title and description to reflect the latest

@codope codope added the go add ONLY when ready to merge, run all tests label Aug 28, 2025
@codope codope changed the title [core] Disbale perf test in shutdown_coordinator_test under tsan [core] Remove perf test in shutdown_coordinator_test Aug 28, 2025
@dayshah dayshah enabled auto-merge (squash) August 28, 2025 06:09
@dayshah dayshah merged commit 235b43f into ray-project:master Aug 28, 2025
6 of 7 checks passed
Comment on lines -296 to -299
// Should be very fast (less than 100ns per call on modern hardware)
double ns_per_call = static_cast<double>(duration.count()) / iterations;
EXPECT_LT(ns_per_call, 100.0)
<< "ShouldEarlyExit too slow: " << ns_per_call << "ns per call";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not write this type of test in CI as a general rule; it's destined to be flaky.

if we want to measure performance, we should do it as a release test where we actually track the metrics over time.

testing this specific call also doesn't make much sense to me.

tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
gangsf pushed a commit to gangsf/ray that referenced this pull request Sep 2, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…6033)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI test linux://src/ray/core_worker/tests:shutdown_coordinator_test is flaky

4 participants