[core] Fix GCS subscribers map race condition#53781
[core] Fix GCS subscribers map race condition#53781edoakes merged 4 commits intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a race condition in the GCS subscribers map by ensuring that all modifications occur on the GCSPublisher IOContext.
- Updated pubsub_handler documentation to clarify thread safety and IOContext expectations.
- Modified GcsServer event listeners to post RemoveSubscriberFrom calls onto the GCSPublisher thread, thereby eliminating race conditions.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/ray/gcs/gcs_server/pubsub_handler.h | Added documentation clarifying that the pubsub handler is not thread safe. |
| src/ray/gcs/gcs_server/gcs_server.cc | Updated subscriber removal calls to post on the GCSPublisher IOContext. |
edoakes
left a comment
There was a problem hiding this comment.
expanding more usage of io-service-post-as-thread-safety makes me sad, but LGTM as patch fix
|
Actually, can we enforce this more explicitly by moving the ray/src/ray/gcs/gcs_server/pubsub_handler.cc Line 116 in 149e152 It's only ever called in the contexts you've fixed here. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Done, I made it dispatch inside the function so that if it is ever called from the same io_context, it'll execute immediately and not post. I think there's some argument here on what level you want the thread safety to live at. Now it's less clear at the caller site that this thing won't be executed immediately, but you get the benefit of safety that nobody could make this mistake again. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Actually going with calling it Async and always posting to make the behavior clear and consistent. This should only ever be called externally. Internally should just erase from the map directly. Added a comment noting that. |
## Problem The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617 `RemoveSubscriberFrom` is called from the main thread on the GCS though. And `sender_to_subscribers_` is a map that is modified there. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124 `sender_to_subscribers_` is also modified in `HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82 Obvious races here. ## Solution We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Problem The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617 `RemoveSubscriberFrom` is called from the main thread on the GCS though. And `sender_to_subscribers_` is a map that is modified there. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124 `sender_to_subscribers_` is also modified in `HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82 Obvious races here. ## Solution We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context. --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
## Problem The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/gcs_server.cc#L614-L617 `RemoveSubscriberFrom` is called from the main thread on the GCS though. And `sender_to_subscribers_` is a map that is modified there. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L124 `sender_to_subscribers_` is also modified in `HandleGcsSubscriberCommandBatch` which will run on the GCSPublisher thread. https://github.com/ray-project/ray/blob/a1b9dc627c7ab0b31ae5df03892f4d49192463e4/src/ray/gcs/gcs_server/pubsub_handler.cc#L82 Obvious races here. ## Solution We can just post the `RemoveSubscriberFrom` call onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Problem
The GCS's pubsub service is created with it's own io context. This means that all handlers will be posted to it's own thread.
ray/src/ray/gcs/gcs_server/gcs_server.cc
Lines 614 to 617 in a1b9dc6
RemoveSubscriberFromis called from the main thread on the GCS though. Andsender_to_subscribers_is a map that is modified there.ray/src/ray/gcs/gcs_server/pubsub_handler.cc
Line 124 in a1b9dc6
sender_to_subscribers_is also modified inHandleGcsSubscriberCommandBatchwhich will run on the GCSPublisher thread.ray/src/ray/gcs/gcs_server/pubsub_handler.cc
Line 82 in a1b9dc6
Obvious races here.
Solution
We can just post the
RemoveSubscriberFromcall onto the GCSPublisher io_context so everything runs on a single thread. This keeps us from trying to make the pub sub handler thread safe. All usage of the pub sub handler should always be on the GCSPublisher io_context.