Skip to content

[core] Fix GCS subscribers map race condition#53781

Merged
edoakes merged 4 commits intoray-project:masterfrom
dayshah:remove-subscriber-race
Jun 16, 2025
Merged

[core] Fix GCS subscribers map race condition#53781
edoakes merged 4 commits intoray-project:masterfrom
dayshah:remove-subscriber-race

Conversation

@dayshah
Copy link
Copy Markdown
Contributor

@dayshah dayshah commented Jun 12, 2025

Problem

The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread.

auto &io_context = io_context_provider_.GetIOContext<GcsPublisher>();
pubsub_handler_ = std::make_unique<InternalPubSubHandler>(io_context, *gcs_publisher_);
rpc_server_.RegisterService(
std::make_unique<rpc::InternalPubSubGrpcService>(io_context, *pubsub_handler_));

RemoveSubscriberFrom is called from the main thread on the GCS though. And sender_to_subscribers_ is a map that is modified there.

sender_to_subscribers_.erase(iter);

sender_to_subscribers_ is also modified in HandleGcsSubscriberCommandBatch which will run on the GCSPublisher thread.

iter = sender_to_subscribers_.insert({sender_id, {}}).first;

Obvious races here.

Solution

We can just post the RemoveSubscriberFrom call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Copilot AI review requested due to automatic review settings June 12, 2025 21:44
@dayshah dayshah requested a review from a team as a code owner June 12, 2025 21:44
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition in the GCS subscribers map by ensuring that all modifications occur on the GCSPublisher IOContext.

  • Updated pubsub_handler documentation to clarify thread safety and IOContext expectations.
  • Modified GcsServer event listeners to post RemoveSubscriberFrom calls onto the GCSPublisher thread, thereby eliminating race conditions.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/ray/gcs/gcs_server/pubsub_handler.h Added documentation clarifying that the pubsub handler is not thread safe.
src/ray/gcs/gcs_server/gcs_server.cc Updated subscriber removal calls to post on the GCSPublisher IOContext.

@dayshah dayshah marked this pull request as draft June 12, 2025 22:40
@dayshah dayshah marked this pull request as ready for review June 13, 2025 06:11
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanding more usage of io-service-post-as-thread-safety makes me sad, but LGTM as patch fix

@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 13, 2025

Actually, can we enforce this more explicitly by moving the Post call inside the implementation of RemoveSubscriberFrom?

void InternalPubSubHandler::RemoveSubscriberFrom(const std::string &sender_id) {

It's only ever called in the contexts you've fixed here.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Copy Markdown
Contributor Author

dayshah commented Jun 13, 2025

Actually, can we enforce this more explicitly by moving the Post call inside the implementation of RemoveSubscriberFrom?

void InternalPubSubHandler::RemoveSubscriberFrom(const std::string &sender_id) {

It's only ever called in the contexts you've fixed here.

Done, I made it dispatch inside the function so that if it is ever called from the same io_context, it'll execute immediately and not post.

I think there's some argument here on what level you want the thread safety to live at. Now it's less clear at the caller site that this thing won't be executed immediately, but you get the benefit of safety that nobody could make this mistake again.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Copy Markdown
Contributor Author

dayshah commented Jun 13, 2025

Actually, can we enforce this more explicitly by moving the Post call inside the implementation of RemoveSubscriberFrom?

void InternalPubSubHandler::RemoveSubscriberFrom(const std::string &sender_id) {

It's only ever called in the contexts you've fixed here.

Done, I made it dispatch inside the function so that if it is ever called from the same io_context, it'll execute immediately and not post.

I think there's some argument here on what level you want the thread safety to live at. Now it's less clear at the caller site that this thing won't be executed immediately, but you get the benefit of safety that nobody could make this mistake again.

Actually going with calling it Async and always posting to make the behavior clear and consistent. This should only ever be called externally. Internally should just erase from the map directly. Added a comment noting that.

@edoakes edoakes merged commit 861b319 into ray-project:master Jun 16, 2025
5 checks passed
@dayshah dayshah deleted the remove-subscriber-race branch June 16, 2025 14:41
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
## Problem

The GCS's pubsub service is created with it's own io context. This means
that all handlers will be posted to it's own thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617

`RemoveSubscriberFrom` is called from the main thread on the GCS though.
And `sender_to_subscribers_` is a map that is modified there.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124

`sender_to_subscribers_` is also modified in
`HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher
thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82

Obvious races here.

## Solution
We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher
io_context so everything runs on a single thread. This keeps us from
trying to make the pub sub handler thread safe. All usage of the pub sub
handler should always be on the GCSPublisher io_context.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
## Problem

The GCS's pubsub service is created with it's own io context. This means
that all handlers will be posted to it's own thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617

`RemoveSubscriberFrom` is called from the main thread on the GCS though.
And `sender_to_subscribers_` is a map that is modified there.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124

`sender_to_subscribers_` is also modified in
`HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher
thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82

Obvious races here.

## Solution
We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher
io_context so everything runs on a single thread. This keeps us from
trying to make the pub sub handler thread safe. All usage of the pub sub
handler should always be on the GCSPublisher io_context.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
## Problem

The GCS's pubsub service is created with it's own io context. This means
that all handlers will be posted to it's own thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617

`RemoveSubscriberFrom` is called from the main thread on the GCS though.
And `sender_to_subscribers_` is a map that is modified there.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124

`sender_to_subscribers_` is also modified in
`HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher
thread.

https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82

Obvious races here.

## Solution
We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher
io_context so everything runs on a single thread. This keeps us from
trying to make the pub sub handler thread safe. All usage of the pub sub
handler should always be on the GCSPublisher io_context.

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants