[Serve] Get ServeHandle on the same node#11477
Conversation
| router_node_ids = ray.get([ | ||
| check_handle_router_id.options(resources={ | ||
| node_id: 0.01 | ||
| }).remote() for node_id in ray.state.node_ids() |
There was a problem hiding this comment.
What's the difference between ray.state.node_ids() here and node_ids on line 146 above?
There was a problem hiding this comment.
ray.state.node_ids() is a list of resource key in form of ["NODE-IP"]
node_ids above is a list of hex ids ["aabbccdd12312312", ...]
| }).remote() for node_id in ray.state.node_ids() | ||
| ]) | ||
|
|
||
| assert set(router_node_ids) == set(node_ids) |
There was a problem hiding this comment.
Because we're converting to set, it seems like this assertion could pass even if the routers somehow got swapped. Is this something to worry about?
edoakes
left a comment
There was a problem hiding this comment.
Looks good but please remove the missing_ok change unless it's required for this
| missing_ok (bool): If true, then Serve won't check the endpoint is | ||
| registered. False by default. |
There was a problem hiding this comment.
Are we re-adding this? Is it necessary for this change?
| router_chosen = next( | ||
| filter(lambda r: get_node_id_for_actor(r) == current_node_id, | ||
| routers)) | ||
| except StopIteration: | ||
| logger.warning( | ||
| f"When getting a handle for {endpoint_name}, Serve can't find " | ||
| "a router on the same node. Serve will use a random router.") | ||
| router_chosen = random.choice(routers) |
There was a problem hiding this comment.
this logic is a lot of "magic" -- any reason not to just iterate over the routers in a for loop? would be more readable
There was a problem hiding this comment.
I will just leave this logic as is. they will be gone really really soon with upcoming refactoring ;-)
|
Richard needs the missing_ok flag because there's a circular dependency between endpoint and the s3 downloaded. Namely:
|
|
@simon-mo @edoakes if we get #11444 (comment) I believe the circular dependency would be resolved, because then I could init the S3 clients in a separate step after the backends are fully initialized. |
Why are these changes needed?
@crypdick ran into this issue running Serve on 20+ machines. This PR resolves a tech debt we had in the past. It will now retrieve the router from the same node, if we can't find a node, it will use a random node. This behavior should be better than offloading all requests to a single node.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.