[Serve] Avoid concurrent modification of `LongPollClient.snapshot_ids` by JoshKarpel · Pull Request #52793 · ray-project/ray

JoshKarpel · 2025-05-05T20:54:23Z

Why are these changes needed?

This is a followup to our previous PR #48807 , which changed the way that the LongPollClient is managed by Routers such that there is, in the long-term, only a single active LongPollClient per client process.

To make that happen, that PR added a way to change the set of snapshot_ids that the LongPollClient was listening for. Unfortunately, I didn't realize at the time that the registration code that changes the snapshot_ids would be running in a different thread from the LongPollClient's polling loop, which means that the snapshot ID dictionary could be concurrently mutated from one thread while the other thread is trying to serialize it, leading to a stack trace like:

Exception in callback <function LongPollClient._process_update.<locals>.chained at 0x7f1d4afbc540>
handle: <Handle LongPollClient._process_update.<locals>.chained>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
  File ".../site-packages/ray/serve/_private/long_poll.py", line 199, in chained
    self._on_callback_completed(trigger_at=len(updates))
  File ".../site-packages/ray/serve/_private/long_poll.py", line 133, in _on_callback_completed
    self._poll_next()
  File ".../site-packages/ray/serve/_private/long_poll.py", line 143, in _poll_next
    self._current_ref = self.host_actor.listen_for_change.remote(self.snapshot_ids)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 206, in remote
    return self._remote(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/util/tracing/tracing_helper.py", line 422, in _start_span
    return method(self, args, kwargs, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 366, in _remote
    return invocation(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 347, in invocation
    return actor._actor_method_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 1503, in _actor_method_call
    object_refs = worker.core_worker.submit_actor_task(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/ray/_raylet.pyx", line 3972, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 3977, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 859, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 850, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 900, in ray._raylet.prepare_args_internal
  File "/app/virtualenv/lib/python3.12/site-packages/ray/_private/serialization.py", line 556, in serialize
    return self._serialize_to_msgpack(value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/_private/serialization.py", line 530, in _serialize_to_msgpack
    msgpack_data = MessagePackSerializer.dumps(value, _python_serializer)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/ray/includes/serialization.pxi", line 175, in ray._raylet.MessagePackSerializer.dumps
  File ".../site-packages/msgpack/__init__.py", line 36, in packb
    return Packer(**kwargs).pack(o)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "msgpack/_packer.pyx", line 279, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 276, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 265, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 211, in msgpack._cmsgpack.Packer._pack_inner
RuntimeError: dictionary changed size during iteration

If this happens (it's rare; we've been running the shared long poll code for months but only noticed this when we started making a lot more Serve deployments per unit time recently), the LongPollClient never calls poll_next() again, and thus no longer receives updates from the Serve Controller. Depending on how fast your actor replicas come and go, this eventually leads to poor scheduling, and then no scheduling if all the replicas you had at the time of the error go away.

A simple fix for this is to shallow copy the snapshot IDs just before passing them to the .remote() call. The shallow copy is then not modified if the original snapshot IDs held by the LongPollClient are modified. We've tested this fix in our system and it did prevent the problem from occurring.

Since it would be difficult to arrange this to happen in a deterministic test, I've added a comment describing the problem case that the copy solves for.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

zcin · 2025-05-05T22:29:22Z

python/ray/serve/_private/long_poll.py

+            # (from the SharedRouterLongPollClient registering a new router).
+            # See https://github.com/ray-project/ray/pull/52793 for more details.
+            self.snapshot_ids.copy()
+        )


So the register call in the shared long poll client, which modifies self.snapshot_ids, is called in the main event loop, whereas the infinite poll loop which serializes self.snapshot_ids runs in the router event loop.

Seems like the issue comes from iterating through the dictionary self.snapshot_ids when trying to serialize it, and if the serializer happens to be in the middle of iterating through the dict when the update happens, we run into this race condition. However copy will probably also need to iterate through the dictionary, so would this not run into the same issue?

It seems like we should either add a lock around relevant reads/writes of self.snapshot_ids, or submit add_key_listeners to be called in the router event loop.

Ooo, that's a really good point, I missed that! I wonder why we haven't seen the problem since we added the .copy() - perhaps because it just further reduce the time window that it can happen in, and thus reduces the probability that it will happen?

It seems like we should either add a lock around relevant reads/writes of self.snapshot_ids, or submit add_key_listeners to be called in the router event loop.

I will ponder these options!

I wonder why we haven't seen the problem since we added the .copy() - perhaps because it just further reduce the time window that it can happen in, and thus reduces the probability that it will happen?

Yeah, that seems very likely - copy is probably faster than serialization, and so it's much harder to run into the race condition.

I think I know what's happening with the .copy() - because the underlying object is the built-in dict, the .copy() method is actually a built-in function:

Python 3.12.6 (main, Sep 9 2024, 21:36:32) [Clang 18.1.8 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> d = dict() >>> d.copy <built-in method copy of dict object at 0x100d0b180>

and the trick with built-in methods is that they ultimately execute as a single CPython VM bytecode instruction, and therefore cannot be interrupted by another Python thread if they choose to hold the GIL for the duration of execution.

So, it works, but it won't if the underlying data structure ever changes to one with a non-built-in .copy() method! I will poke at the lock or event loop submission solutions today for a more future-proof solution.

I went with the call_soon_threadsafe approach - that seems more elegant to me than the lock.

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

zcin

Nice, LGTM! OOC @JoshKarpel were you able to test this in the environment where you were previously seeing issues?

JoshKarpel · 2025-05-07T15:46:16Z

Nice, LGTM! OOC @JoshKarpel were you able to test this in the environment where you were previously seeing issues?

Good point, no, we have the .copy() method in place - I'll switch it and get back to you.

JoshKarpel · 2025-05-16T18:11:57Z

@zcin we've had this deployed for 2 days now and haven't seen the error - happy to let it bake longer if you want though!

zcin · 2025-05-19T17:18:49Z

@JoshKarpel sounds good, I think this is good to merge in then! I'm going to update the branch one more time to prevent merge issues.

JoshKarpel · 2025-05-19T18:40:41Z

Thanks @zcin !

#53613) ## Why are these changes needed? We found a second bug with the shared long poll client, similar to #52793 - this time it's for a different part of the registration process. The error we saw looked like: ``` 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215> Traceback (most recent call last): File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained callback(arg) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config for router in self.routers[deployment_id]: ~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__ for itemref in self.data: ^^^^^^^^^ RuntimeError: Set changed size during iteration ``` This is getting pretty gnarly. I feel like there's a potential future refactor of the long poll client code that makes it truly `async` so that we don't need to worry about this stuff? But I'm not too familiar with how/why the background threads are being used here. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

#53613) ## Why are these changes needed? We found a second bug with the shared long poll client, similar to #52793 - this time it's for a different part of the registration process. The error we saw looked like: ``` 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215> Traceback (most recent call last): File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained callback(arg) File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config for router in self.routers[deployment_id]: ~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__ for itemref in self.data: ^^^^^^^^^ RuntimeError: Set changed size during iteration ``` This is getting pretty gnarly. I feel like there's a potential future refactor of the long poll client code that makes it truly `async` so that we don't need to worry about this stuff? But I'm not too familiar with how/why the background threads are being used here. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

JoshKarpel added 2 commits May 5, 2025 15:46

copy snapshot IDs before serializing in the LongPollClient

d68f90c

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

link to PR in comment

0ccbb45

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel marked this pull request as ready for review May 5, 2025 20:57

Merge branch 'master' into fix-shared-long-poll-concurrency-bug

09198d5

zcin reviewed May 5, 2025

View reviewed changes

use call_soon_threadsafe instead

444ba86

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel changed the title ~~[Serve] Copy snapshot IDs before serializing in the LongPollClient to avoid concurrent modification~~ [Serve] Avoid concurrent modification of LongPollClient.snapshot_ids May 6, 2025

JoshKarpel requested a review from zcin May 6, 2025 20:48

zcin approved these changes May 6, 2025

View reviewed changes

zcin added the go add ONLY when ready to merge, run all tests label May 6, 2025

Merge branch 'master' into fix-shared-long-poll-concurrency-bug

6142dc8

hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025

Merge branch 'master' into fix-shared-long-poll-concurrency-bug

94ad49e

mascharkh added serve Ray Serve Related Issue stability labels May 12, 2025

Merge branch 'master' into fix-shared-long-poll-concurrency-bug

85c06cb

zcin enabled auto-merge (squash) May 19, 2025 17:20

zcin merged commit 1d7b01f into ray-project:master May 19, 2025
6 checks passed

JoshKarpel deleted the fix-shared-long-poll-concurrency-bug branch May 19, 2025 18:40

hainesmichaelc added the community-backlog label May 22, 2025

JoshKarpel mentioned this pull request Jun 6, 2025

[Serve] Call shared long poll client router registration in event loop #53613

Merged

8 tasks

JoshKarpel mentioned this pull request Jul 10, 2025

[Serve] Calls to a Serve Deployment's .remote(), hang after some amount of time / requests. #47870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Avoid concurrent modification of `LongPollClient.snapshot_ids`#52793

[Serve] Avoid concurrent modification of `LongPollClient.snapshot_ids`#52793
zcin merged 7 commits intoray-project:masterfrom
JoshKarpel:fix-shared-long-poll-concurrency-bug

JoshKarpel commented May 5, 2025 •

edited

Loading

Uh oh!

zcin May 5, 2025

Uh oh!

zcin May 5, 2025

Uh oh!

JoshKarpel May 5, 2025

Uh oh!

zcin May 5, 2025

Uh oh!

JoshKarpel May 6, 2025

Uh oh!

JoshKarpel May 6, 2025

Uh oh!

zcin left a comment

Uh oh!

JoshKarpel commented May 7, 2025

Uh oh!

JoshKarpel commented May 16, 2025

Uh oh!

zcin commented May 19, 2025

Uh oh!

Uh oh!

JoshKarpel commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JoshKarpel commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

zcin May 5, 2025

Choose a reason for hiding this comment

Uh oh!

zcin May 5, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel May 5, 2025

Choose a reason for hiding this comment

Uh oh!

zcin May 5, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel May 6, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel May 6, 2025

Choose a reason for hiding this comment

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

JoshKarpel commented May 7, 2025

Uh oh!

JoshKarpel commented May 16, 2025

Uh oh!

zcin commented May 19, 2025

Uh oh!

Uh oh!

JoshKarpel commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JoshKarpel commented May 5, 2025 •

edited

Loading