Skip to content

[RFC] cluster: introduce futureThreadLocalCluster interface to cluster manager#15053

Closed
lambdai wants to merge 1 commit intoenvoyproxy:mainfrom
lambdai:futureclusterproto
Closed

[RFC] cluster: introduce futureThreadLocalCluster interface to cluster manager#15053
lambdai wants to merge 1 commit intoenvoyproxy:mainfrom
lambdai:futureclusterproto

Conversation

@lambdai
Copy link
Copy Markdown
Contributor

@lambdai lambdai commented Feb 15, 2021

Partially address #15026

This is a strawman that explains how ClusterManager embraces on-demand cluster by adding futureThreadLocalCluster() method . The ClusterManager user(e.g. the downstream terminated filter) need to switch from getThreadLocalCluster(). I rewrite the tcp proxy filter and add the basic test.

Note that tcp proxy does not need the details of why FutureCluster is not ready.
In the TcpProxyFutureClusterTest I introduced a DelayedFutureCluster to emulate the ondemand cluster config update.
In the traditional TcpProxyTest I introduced a ReadyFutureCluster to explain how to migrate to futureThreadLocalCluster and from getThreadLocalCluster()

Notes for reviewers:
I refactor the tcp proxy test to reduce the time on compile and run test. I can create a separate PR to land it in master if you like.
The newly added test live in test/common/tcp_proxy/tcp_proxy_cluster_test.cc.

Commit Message:

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

DelayedFutureCluster and refactor tcp proxy tests

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

fix LocalClosedBeforeClusterIsReady

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

fix TcpProxyFutureClusterTest

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@bryce-anderson
Copy link
Copy Markdown

This looks like what we'd need for our use case. Since some details are missing, I presume the final implementation would have a failure pathway of sorts, but otherwise it appears that its goal is to provide a somewhat simple interface for establishing a callback for when the cluster resolves.

@mattklein123 mattklein123 self-assigned this Feb 16, 2021
@mattklein123
Copy link
Copy Markdown
Member

At a very high level this looks right to me, however, I think like VHDS the actual lazy loading should be in its own HTTP or network filter that runs before the other filters. I think this would be a cleaner (and less risky design). My advice would be to first build the future API into CM, and then we can make the filters that use it?

@htuch
Copy link
Copy Markdown
Member

htuch commented Feb 16, 2021

+1 on separating the lazy load to a filter. By default, CM should not have to make decisions and we should have non-lazy EDS.

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Feb 16, 2021

@bryce-anderson Thank you for the feedback! The goal of this PR is to explain the necessary changes to the data path including CM beyond http request. I absolutely agree tthat the eventual on-demand CDS/EDS needs a fallback strategy.

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Feb 16, 2021

@htuch @mattklein123
There are 3 arguments

  1. Should CM provide a ClusterManager::futureThreadLocalCluster() API?
    My answer is Yes. Also I want to add that this API is independent of whether the cluster is lazy-loadable. See also question 3. futureThreadLocalCluster() along with FutureCluster could work when the cluster is not lazy-loadable or the cluster is lazy-loaded already. The FutureCluster::isReady() returns the state of whether the cluster is available and save the effort to await.
if (future->isReady()) {
  onClusterReady(*future);
} else {
  future_cluster_handle_ =
      future->await(read_callbacks_->connection().dispatcher(),
                    [this](Upstream::FutureCluster& f) { onClusterReady(f); });
}
  1. What component should initiate the lazy load attempt if the cluster is lazy-loadable?
    In Http conn manager we could use a preload http filter as in the VHDS flow because VHDS filter know the cluster name prior to create the connection in the http router filter.
    In this PR which address the tcp_proxy. TcpProxy is the trigger because only TcpProxy filter knows the target cluster name. See TcpProxy::pickRoute Not all the network filters have the luxury to fire the lazy load attempt prior to request the upstream connection.
    futureThreadLocalCluster() provides the minimum API to enable a lazyload filter.

  2. Is non lazy-loadable EDS cluster allowed? Which API should decide if a known cluster is lazy-loadable?
    This PR doesn't address this issue but I have two cents.
    Let's exclude some answers. Filter should not decide whether a cluster is lazy-loadable. The filter could use FutureCluster::isReady() and not call FutureCluster::await() to avoid blindly wait.
    I prefer to add a field in message Cluster to indicate whether CdsApi should sending EDS request put the cluster in a warming state.
    For the on-demand CDS, because there is no cluster field before lazy-load the cluster (chicken-egg problem). We probably need to provide a matcher for CM to decide whether we should add a cluster name to watcher in on-demand CDS stream. Luckily lazy-loadable EDS cluster doesn't have this problem because we assume CDS must provide the cluster info that include is lazy-loadable.

@htuch
Copy link
Copy Markdown
Member

htuch commented Feb 17, 2021

@lambdai I think one thing that we would want to make sure is possible is receiving a CDS update, storing the cluster config locally in proto but not instantiating a full Cluster (with stats) and then only on-demand doing the instantiation. This would potentially save a ton of memory and CPU on updates for rarely uses clusters. We might even fetch the EDS resources, but leave them uninstantiated.

@mattklein123
Copy link
Copy Markdown
Member

re: (1), yes, agreed, let's add this functionality and test it as an independent PR. It's not very complicated as all the parts are already there. I would note that we don't explicitly need this to be part of the cluster manager API, since with the callback mechanisms in place I think this could be implemented on top. I don't feel strongly about it but you might consider a wrapper class.

re: (2) for HTTP I agree this should be a filter. For TCP, I'm not crazy about making the tcp_proxy filter even more complicated, so it might be worth it to see if we can refactor somehow to allow TCP route selection to be done by a previous filter? I think maybe this is possible? It's worth investigating.

re: (3) I agree with what @htuch said, but this part is complicated enough that I would recommend doing an independent design doc on this portion to make sure we are all on the same page.

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Feb 23, 2021

Thank you for the feedback! I am drafting and I will post in the linked issue

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Feb 24, 2021

@htuch
Copy link
Copy Markdown
Member

htuch commented Feb 25, 2021

I think we can move lazy local expansion to a separate issue, but the API (and ideally implementation) should be forward thinking with this use case in mind, as it's something likely to come up in MT scenarios where on-demand doesn't work for reliability reasons.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 27, 2021
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2021

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Apr 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants