Skip to content

[Serve] Avoid concurrent modification of LongPollClient.snapshot_ids#52793

Merged
zcin merged 7 commits intoray-project:masterfrom
JoshKarpel:fix-shared-long-poll-concurrency-bug
May 19, 2025
Merged

[Serve] Avoid concurrent modification of LongPollClient.snapshot_ids#52793
zcin merged 7 commits intoray-project:masterfrom
JoshKarpel:fix-shared-long-poll-concurrency-bug

Conversation

@JoshKarpel
Copy link
Copy Markdown
Contributor

@JoshKarpel JoshKarpel commented May 5, 2025

Why are these changes needed?

This is a followup to our previous PR #48807 , which changed the way that the LongPollClient is managed by Routers such that there is, in the long-term, only a single active LongPollClient per client process.

To make that happen, that PR added a way to change the set of snapshot_ids that the LongPollClient was listening for. Unfortunately, I didn't realize at the time that the registration code that changes the snapshot_ids would be running in a different thread from the LongPollClient's polling loop, which means that the snapshot ID dictionary could be concurrently mutated from one thread while the other thread is trying to serialize it, leading to a stack trace like:

Exception in callback <function LongPollClient._process_update.<locals>.chained at 0x7f1d4afbc540>
handle: <Handle LongPollClient._process_update.<locals>.chained>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
  File ".../site-packages/ray/serve/_private/long_poll.py", line 199, in chained
    self._on_callback_completed(trigger_at=len(updates))
  File ".../site-packages/ray/serve/_private/long_poll.py", line 133, in _on_callback_completed
    self._poll_next()
  File ".../site-packages/ray/serve/_private/long_poll.py", line 143, in _poll_next
    self._current_ref = self.host_actor.listen_for_change.remote(self.snapshot_ids)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 206, in remote
    return self._remote(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/util/tracing/tracing_helper.py", line 422, in _start_span
    return method(self, args, kwargs, *_args, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 366, in _remote
    return invocation(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 347, in invocation
    return actor._actor_method_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/actor.py", line 1503, in _actor_method_call
    object_refs = worker.core_worker.submit_actor_task(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/ray/_raylet.pyx", line 3972, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 3977, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 859, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 850, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 900, in ray._raylet.prepare_args_internal
  File "/app/virtualenv/lib/python3.12/site-packages/ray/_private/serialization.py", line 556, in serialize
    return self._serialize_to_msgpack(value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../site-packages/ray/_private/serialization.py", line 530, in _serialize_to_msgpack
    msgpack_data = MessagePackSerializer.dumps(value, _python_serializer)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/ray/includes/serialization.pxi", line 175, in ray._raylet.MessagePackSerializer.dumps
  File ".../site-packages/msgpack/__init__.py", line 36, in packb
    return Packer(**kwargs).pack(o)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "msgpack/_packer.pyx", line 279, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 276, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 265, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 211, in msgpack._cmsgpack.Packer._pack_inner
RuntimeError: dictionary changed size during iteration

If this happens (it's rare; we've been running the shared long poll code for months but only noticed this when we started making a lot more Serve deployments per unit time recently), the LongPollClient never calls poll_next() again, and thus no longer receives updates from the Serve Controller. Depending on how fast your actor replicas come and go, this eventually leads to poor scheduling, and then no scheduling if all the replicas you had at the time of the error go away.

A simple fix for this is to shallow copy the snapshot IDs just before passing them to the .remote() call. The shallow copy is then not modified if the original snapshot IDs held by the LongPollClient are modified. We've tested this fix in our system and it did prevent the problem from occurring.

Since it would be difficult to arrange this to happen in a deterministic test, I've added a comment describing the problem case that the copy solves for.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

JoshKarpel added 2 commits May 5, 2025 15:46
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@JoshKarpel JoshKarpel marked this pull request as ready for review May 5, 2025 20:57
# (from the SharedRouterLongPollClient registering a new router).
# See https://github.com/ray-project/ray/pull/52793 for more details.
self.snapshot_ids.copy()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the register call in the shared long poll client, which modifies self.snapshot_ids, is called in the main event loop, whereas the infinite poll loop which serializes self.snapshot_ids runs in the router event loop.

Seems like the issue comes from iterating through the dictionary self.snapshot_ids when trying to serialize it, and if the serializer happens to be in the middle of iterating through the dict when the update happens, we run into this race condition. However copy will probably also need to iterate through the dictionary, so would this not run into the same issue?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should either add a lock around relevant reads/writes of self.snapshot_ids, or submit add_key_listeners to be called in the router event loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, that's a really good point, I missed that! I wonder why we haven't seen the problem since we added the .copy() - perhaps because it just further reduce the time window that it can happen in, and thus reduces the probability that it will happen?

It seems like we should either add a lock around relevant reads/writes of self.snapshot_ids, or submit add_key_listeners to be called in the router event loop.

I will ponder these options!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we haven't seen the problem since we added the .copy() - perhaps because it just further reduce the time window that it can happen in, and thus reduces the probability that it will happen?

Yeah, that seems very likely - copy is probably faster than serialization, and so it's much harder to run into the race condition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know what's happening with the .copy() - because the underlying object is the built-in dict, the .copy() method is actually a built-in function:

Python 3.12.6 (main, Sep  9 2024, 21:36:32) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> d = dict()
>>> d.copy
<built-in method copy of dict object at 0x100d0b180>

and the trick with built-in methods is that they ultimately execute as a single CPython VM bytecode instruction, and therefore cannot be interrupted by another Python thread if they choose to hold the GIL for the duration of execution.

So, it works, but it won't if the underlying data structure ever changes to one with a non-built-in .copy() method! I will poke at the lock or event loop submission solutions today for a more future-proof solution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the call_soon_threadsafe approach - that seems more elegant to me than the lock.

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@JoshKarpel JoshKarpel changed the title [Serve] Copy snapshot IDs before serializing in the LongPollClient to avoid concurrent modification [Serve] Avoid concurrent modification of LongPollClient.snapshot_ids May 6, 2025
@JoshKarpel JoshKarpel requested a review from zcin May 6, 2025 20:48
Copy link
Copy Markdown
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM! OOC @JoshKarpel were you able to test this in the environment where you were previously seeing issues?

@zcin zcin added the go add ONLY when ready to merge, run all tests label May 6, 2025
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

Nice, LGTM! OOC @JoshKarpel were you able to test this in the environment where you were previously seeing issues?

Good point, no, we have the .copy() method in place - I'll switch it and get back to you.

@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025
@mascharkh mascharkh added serve Ray Serve Related Issue stability labels May 12, 2025
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

@zcin we've had this deployed for 2 days now and haven't seen the error - happy to let it bake longer if you want though!

@zcin
Copy link
Copy Markdown
Contributor

zcin commented May 19, 2025

@JoshKarpel sounds good, I think this is good to merge in then! I'm going to update the branch one more time to prevent merge issues.

@zcin zcin enabled auto-merge (squash) May 19, 2025 17:20
@zcin zcin merged commit 1d7b01f into ray-project:master May 19, 2025
6 checks passed
@JoshKarpel JoshKarpel deleted the fix-shared-long-poll-concurrency-bug branch May 19, 2025 18:40
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

Thanks @zcin !

zcin pushed a commit that referenced this pull request Jun 13, 2025
#53613)

## Why are these changes needed?

We found a second bug with the shared long poll client, similar to
#52793 - this time it's for a
different part of the registration process. The error we saw looked
like:
```
 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215
 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215>
 Traceback (most recent call last):
   File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
     self._context.run(self._callback, *self._args)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained
     callback(arg)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config
     for router in self.routers[deployment_id]:
                   ~~~~~~~~~~~~^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__
     for itemref in self.data:
                    ^^^^^^^^^
 RuntimeError: Set changed size during iteration
```

This is getting pretty gnarly. I feel like there's a potential future
refactor of the long poll client code that makes it truly `async` so
that we don't need to worry about this stuff? But I'm not too familiar
with how/why the background threads are being used here.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [x] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
#53613)

## Why are these changes needed?

We found a second bug with the shared long poll client, similar to
#52793 - this time it's for a
different part of the registration process. The error we saw looked
like:
```
 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215
 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215>
 Traceback (most recent call last):
   File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
     self._context.run(self._callback, *self._args)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained
     callback(arg)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config
     for router in self.routers[deployment_id]:
                   ~~~~~~~~~~~~^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__
     for itemref in self.data:
                    ^^^^^^^^^
 RuntimeError: Set changed size during iteration
```

This is getting pretty gnarly. I feel like there's a potential future
refactor of the long poll client code that makes it truly `async` so
that we don't need to worry about this stuff? But I'm not too familiar
with how/why the background threads are being used here.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [x] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
#53613)

## Why are these changes needed?

We found a second bug with the shared long poll client, similar to
#52793 - this time it's for a
different part of the registration process. The error we saw looked
like:
```
 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215
 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215>
 Traceback (most recent call last):
   File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
     self._context.run(self._callback, *self._args)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained
     callback(arg)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config
     for router in self.routers[deployment_id]:
                   ~~~~~~~~~~~~^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__
     for itemref in self.data:
                    ^^^^^^^^^
 RuntimeError: Set changed size during iteration
```

This is getting pretty gnarly. I feel like there's a potential future
refactor of the long poll client code that makes it truly `async` so
that we don't need to worry about this stuff? But I'm not too familiar
with how/why the background threads are being used here.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [x] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue stability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants