Skip to content

[Serve] Get ServeHandle on the same node#11477

Merged
simon-mo merged 4 commits intoray-project:masterfrom
simon-mo:serve/handle-affinity
Oct 20, 2020
Merged

[Serve] Get ServeHandle on the same node#11477
simon-mo merged 4 commits intoray-project:masterfrom
simon-mo:serve/handle-affinity

Conversation

@simon-mo
Copy link
Copy Markdown
Contributor

Why are these changes needed?

@crypdick ran into this issue running Serve on 20+ machines. This PR resolves a tech debt we had in the past. It will now retrieve the router from the same node, if we can't find a node, it will use a random node. This behavior should be better than offloading all requests to a single node.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

router_node_ids = ray.get([
check_handle_router_id.options(resources={
node_id: 0.01
}).remote() for node_id in ray.state.node_ids()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between ray.state.node_ids() here and node_ids on line 146 above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.state.node_ids() is a list of resource key in form of ["NODE-IP"]
node_ids above is a list of hex ids ["aabbccdd12312312", ...]

}).remote() for node_id in ray.state.node_ids()
])

assert set(router_node_ids) == set(node_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're converting to set, it seems like this assertion could pass even if the routers somehow got swapped. Is this something to worry about?

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but please remove the missing_ok change unless it's required for this

Comment on lines +328 to +329
missing_ok (bool): If true, then Serve won't check the endpoint is
registered. False by default.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we re-adding this? Is it necessary for this change?

Comment on lines +342 to +349
router_chosen = next(
filter(lambda r: get_node_id_for_actor(r) == current_node_id,
routers))
except StopIteration:
logger.warning(
f"When getting a handle for {endpoint_name}, Serve can't find "
"a router on the same node. Serve will use a random router.")
router_chosen = random.choice(routers)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is a lot of "magic" -- any reason not to just iterate over the routers in a for loop? would be more readable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just leave this logic as is. they will be gone really really soon with upcoming refactoring ;-)

@simon-mo
Copy link
Copy Markdown
Contributor Author

Richard needs the missing_ok flag because there's a circular dependency between endpoint and the s3 downloaded. Namely:

  • S3 clients are started in backend init method, need to get handle for the endpoint.
  • The create backend blocks on init method.
  • The create endpoint requires backend created already.

@crypdick
Copy link
Copy Markdown
Contributor

crypdick commented Oct 20, 2020

@simon-mo @edoakes if we get #11444 (comment) I believe the circular dependency would be resolved, because then I could init the S3 clients in a separate step after the backends are fully initialized.

@simon-mo
Copy link
Copy Markdown
Contributor Author

Hi @crypdick, #11444 will take longer to develop. Let's merge this one and test it out.

@simon-mo simon-mo merged commit 2c5cb95 into ray-project:master Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants