Skip to content

[Serve] Call shared long poll client router registration in event loop#53613

Merged
zcin merged 3 commits intoray-project:masterfrom
JoshKarpel:fix-second-shared-long-poll-concurreny-bug
Jun 13, 2025
Merged

[Serve] Call shared long poll client router registration in event loop#53613
zcin merged 3 commits intoray-project:masterfrom
JoshKarpel:fix-second-shared-long-poll-concurreny-bug

Conversation

@JoshKarpel
Copy link
Copy Markdown
Contributor

Why are these changes needed?

We found a second bug with the shared long poll client, similar to #52793 - this time it's for a different part of the registration process. The error we saw looked like:

 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215
 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215>
 Traceback (most recent call last):
   File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
     self._context.run(self._callback, *self._args)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained
     callback(arg)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config
     for router in self.routers[deployment_id]:
                   ~~~~~~~~~~~~^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__
     for itemref in self.data:
                    ^^^^^^^^^
 RuntimeError: Set changed size during iteration

This is getting pretty gnarly. I feel like there's a potential future refactor of the long poll client code that makes it truly async so that we don't need to worry about this stuff? But I'm not too familiar with how/why the background threads are being used here.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@JoshKarpel JoshKarpel changed the title [Serve] Call shared long poll router registration in event loop [Serve] Call shared long poll client router registration in event loop Jun 6, 2025
@JoshKarpel JoshKarpel marked this pull request as ready for review June 6, 2025 17:28
Copilot AI review requested due to automatic review settings June 6, 2025 17:28
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

cc @zcin for review, we found another one!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Updates the shared long-poll client to schedule router registrations on the provided asyncio event loop, preventing concurrent mutations of the router set.

  • Stores the event loop in SharedRouterLongPollClient
  • Switches LongPollClient to use the stored loop
  • Wraps register calls with call_soon_threadsafe and introduces a private _register method
Comments suppressed due to low confidence (3)

python/ray/serve/_private/router.py:904

  • Add unit tests that simulate concurrent long-poll callbacks and registrations to verify that call_soon_threadsafe prevents RuntimeError during iteration.
self.event_loop.call_soon_threadsafe(self._register, router)

python/ray/serve/_private/router.py:906

  • [nitpick] Add a docstring for _register explaining that it runs on the event loop to safely mutate self.routers.
def _register(self, router: AsyncioRouter) -> None:

python/ray/serve/_private/router.py:904

  • Consider applying the same call_soon_threadsafe scheduling to any deregistration or cleanup paths to avoid similar race conditions on removal.
self.event_loop.call_soon_threadsafe(self._register, router)

@JoshKarpel
Copy link
Copy Markdown
Contributor Author

Gentle bump on this one @zcin , adding @edoakes as well

@zcin zcin added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
@JoshKarpel JoshKarpel requested a review from a team as a code owner June 12, 2025 17:11
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

@zcin looks like CI passed, I think this is ready to merge 🤞🏻

@zcin zcin merged commit df59880 into ray-project:master Jun 13, 2025
5 checks passed
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

Thanks!

@JoshKarpel JoshKarpel deleted the fix-second-shared-long-poll-concurreny-bug branch June 13, 2025 16:51
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
#53613)

## Why are these changes needed?

We found a second bug with the shared long poll client, similar to
#52793 - this time it's for a
different part of the registration process. The error we saw looked
like:
```
 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215
 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215>
 Traceback (most recent call last):
   File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
     self._context.run(self._callback, *self._args)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained
     callback(arg)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config
     for router in self.routers[deployment_id]:
                   ~~~~~~~~~~~~^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__
     for itemref in self.data:
                    ^^^^^^^^^
 RuntimeError: Set changed size during iteration
```

This is getting pretty gnarly. I feel like there's a potential future
refactor of the long poll client code that makes it truly `async` so
that we don't need to worry about this stuff? But I'm not too familiar
with how/why the background threads are being used here.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [x] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
#53613)

## Why are these changes needed?

We found a second bug with the shared long poll client, similar to
#52793 - this time it's for a
different part of the registration process. The error we saw looked
like:
```
 2025-06-06 14:29:14 ERROR Exception in callback LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215
 handle: <Handle LongPollClient._process_update.<locals>.chained() at /app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py:215>
 Traceback (most recent call last):
   File "/usr/lib64/python3.12/asyncio/events.py", line 88, in _run
     self._context.run(self._callback, *self._args)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/long_poll.py", line 216, in chained
     callback(arg)
   File "/app/virtualenv/lib64/python3.12/site-packages/ray/serve/_private/router.py", line 833, in update_deployment_config
     for router in self.routers[deployment_id]:
                   ~~~~~~~~~~~~^^^^^^^^^^^^^^^
   File "/usr/lib64/python3.12/_weakrefset.py", line 65, in __iter__
     for itemref in self.data:
                    ^^^^^^^^^
 RuntimeError: Set changed size during iteration
```

This is getting pretty gnarly. I feel like there's a potential future
refactor of the long poll client code that makes it truly `async` so
that we don't need to worry about this stuff? But I'm not too familiar
with how/why the background threads are being used here.

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [x] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants