Skip to content

[core] Fix flaky ShutdownCoordinator test under tsan#56152

Merged
jjyao merged 3 commits intoray-project:masterfrom
codope:flaky_shutdown_coord
Sep 3, 2025
Merged

[core] Fix flaky ShutdownCoordinator test under tsan#56152
jjyao merged 3 commits intoray-project:masterfrom
codope:flaky_shutdown_coord

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Sep 2, 2025

Why are these changes needed?

TSAN failure is a data race only in the test's FakeShutdownExecutor, not production code. Fake was writing shared std::string fields from two threads without synchronization -- https://buildkite.com/ray-project/postmerge/builds/12666#019907c0-3bdd-4401-9aa8-6f13215ce819/176-796

  • Added a std::mutex to FakeShutdownExecutor and guarded assignments to last_exit_type and last_detail in all Execute* methods. No production code changes.
  • Added mutex-guarded getters in FakeShutdownExecutor and used them in the assertion to eliminate remaining unsynchronized reads in the TSAN test.

Related issue number

Closes #55801

@codope codope requested a review from a team as a code owner September 2, 2025 06:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a data race in FakeShutdownExecutor by introducing a std::mutex to protect concurrent writes to shared member variables. This resolves the write-write race condition. However, a read-write data race still exists in the Concurrent_DoubleForce_ForceExecutesOnce test due to unprotected reads. I've added a comment with a suggestion on how to fully resolve this by using thread-safe getter methods.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Sep 2, 2025
@codope codope added the go add ONLY when ready to merge, run all tests label Sep 2, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope force-pushed the flaky_shutdown_coord branch from 1a7882d to be1f2f7 Compare September 3, 2025 02:55
@jjyao jjyao merged commit bd684c7 into ray-project:master Sep 3, 2025
5 checks passed
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…6152)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…6152)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…6152)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…6152)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI test linux://src/ray/core_worker/tests:shutdown_coordinator_test is flaky

3 participants