[Serve] Avoid concurrent modification of LongPollClient.snapshot_ids#52793
Conversation
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
| # (from the SharedRouterLongPollClient registering a new router). | ||
| # See https://github.com/ray-project/ray/pull/52793 for more details. | ||
| self.snapshot_ids.copy() | ||
| ) |
There was a problem hiding this comment.
So the register call in the shared long poll client, which modifies self.snapshot_ids, is called in the main event loop, whereas the infinite poll loop which serializes self.snapshot_ids runs in the router event loop.
Seems like the issue comes from iterating through the dictionary self.snapshot_ids when trying to serialize it, and if the serializer happens to be in the middle of iterating through the dict when the update happens, we run into this race condition. However copy will probably also need to iterate through the dictionary, so would this not run into the same issue?
There was a problem hiding this comment.
It seems like we should either add a lock around relevant reads/writes of self.snapshot_ids, or submit add_key_listeners to be called in the router event loop.
There was a problem hiding this comment.
Ooo, that's a really good point, I missed that! I wonder why we haven't seen the problem since we added the .copy() - perhaps because it just further reduce the time window that it can happen in, and thus reduces the probability that it will happen?
It seems like we should either add a lock around relevant reads/writes of
self.snapshot_ids, or submitadd_key_listenersto be called in the router event loop.
I will ponder these options!
There was a problem hiding this comment.
I wonder why we haven't seen the problem since we added the .copy() - perhaps because it just further reduce the time window that it can happen in, and thus reduces the probability that it will happen?
Yeah, that seems very likely - copy is probably faster than serialization, and so it's much harder to run into the race condition.
There was a problem hiding this comment.
I think I know what's happening with the .copy() - because the underlying object is the built-in dict, the .copy() method is actually a built-in function:
Python 3.12.6 (main, Sep 9 2024, 21:36:32) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> d = dict()
>>> d.copy
<built-in method copy of dict object at 0x100d0b180>
and the trick with built-in methods is that they ultimately execute as a single CPython VM bytecode instruction, and therefore cannot be interrupted by another Python thread if they choose to hold the GIL for the duration of execution.
So, it works, but it won't if the underlying data structure ever changes to one with a non-built-in .copy() method! I will poke at the lock or event loop submission solutions today for a more future-proof solution.
There was a problem hiding this comment.
I went with the call_soon_threadsafe approach - that seems more elegant to me than the lock.
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
LongPollClient.snapshot_ids
zcin
left a comment
There was a problem hiding this comment.
Nice, LGTM! OOC @JoshKarpel were you able to test this in the environment where you were previously seeing issues?
Good point, no, we have the |
|
@zcin we've had this deployed for 2 days now and haven't seen the error - happy to let it bake longer if you want though! |
|
@JoshKarpel sounds good, I think this is good to merge in then! I'm going to update the branch one more time to prevent merge issues. |
|
Thanks @zcin ! |
#53613) ## Why are these changes needed? We found a second bug with the shared long poll client, similar to #52793 - this time it's for a different part of the registration process. The error we saw looked like: ``` 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215> Traceback (most recent call last): File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained callback(arg) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config for router in self.routers[deployment_id]: ~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__ for itemref in self.data: ^^^^^^^^^ RuntimeError: Set changed size during iteration ``` This is getting pretty gnarly. I feel like there's a potential future refactor of the long poll client code that makes it truly `async` so that we don't need to worry about this stuff? But I'm not too familiar with how/why the background threads are being used here. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
#53613) ## Why are these changes needed? We found a second bug with the shared long poll client, similar to #52793 - this time it's for a different part of the registration process. The error we saw looked like: ``` 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215> Traceback (most recent call last): File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained callback(arg) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config for router in self.routers[deployment_id]: ~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__ for itemref in self.data: ^^^^^^^^^ RuntimeError: Set changed size during iteration ``` This is getting pretty gnarly. I feel like there's a potential future refactor of the long poll client code that makes it truly `async` so that we don't need to worry about this stuff? But I'm not too familiar with how/why the background threads are being used here. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
#53613) ## Why are these changes needed? We found a second bug with the shared long poll client, similar to #52793 - this time it's for a different part of the registration process. The error we saw looked like: ``` 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215> Traceback (most recent call last): File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained callback(arg) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config for router in self.routers[deployment_id]: ~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__ for itemref in self.data: ^^^^^^^^^ RuntimeError: Set changed size during iteration ``` This is getting pretty gnarly. I feel like there's a potential future refactor of the long poll client code that makes it truly `async` so that we don't need to worry about this stuff? But I'm not too familiar with how/why the background threads are being used here. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
This is a followup to our previous PR #48807 , which changed the way that the
LongPollClientis managed byRouters such that there is, in the long-term, only a single activeLongPollClientper client process.To make that happen, that PR added a way to change the set of
snapshot_idsthat theLongPollClientwas listening for. Unfortunately, I didn't realize at the time that the registration code that changes thesnapshot_idswould be running in a different thread from theLongPollClient's polling loop, which means that the snapshot ID dictionary could be concurrently mutated from one thread while the other thread is trying to serialize it, leading to a stack trace like:If this happens (it's rare; we've been running the shared long poll code for months but only noticed this when we started making a lot more Serve deployments per unit time recently), the
LongPollClientnever callspoll_next()again, and thus no longer receives updates from the Serve Controller. Depending on how fast your actor replicas come and go, this eventually leads to poor scheduling, and then no scheduling if all the replicas you had at the time of the error go away.A simple fix for this is to shallow copy the snapshot IDs just before passing them to the
.remote()call. The shallow copy is then not modified if the original snapshot IDs held by theLongPollClientare modified. We've tested this fix in our system and it did prevent the problem from occurring.Since it would be difficult to arrange this to happen in a deterministic test, I've added a comment describing the problem case that the copy solves for.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.