subset load balancer scalability concerns

Envoy is becoming unstable (throughput drops) due to CPU spikes (>40% usage) when using the subset LB under the following conditions:

* an upstream cluster with > 1k endpoints
* where each endpoint has 3 metadata keys
* and there's at least 6 subsets active at any time
* and healthchecks change state at a rate of ~40/sec, which happens during deploys or when the cluster is rapidly scaling up or down

Although we've done some work to ensure rebuilding the subset LB is more efficient (https://github.com/envoyproxy/envoy/commit/17efc838016101f7607fbb9a27151da606e0bd13), the spikes — albeit reduced a bit — are still there. 

Relaxing the healtcheck thresholds would help. Though, it could also yield the undesired result of having valid healthchecks kick in when it's too late. 

A better solution might be to rate limit or coalesce healthcheck state changes when there's too many of them being triggered. That could maybe be done somewhere around:

https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/cluster_manager_impl.cc#L583.

There's also the possibility, previously mentioned by @mattklein123, of moving the subset LB rebuild work into the main thread and then posting the results. 

Ideally, the subset LB should be able to operate under the above conditions without any manual tuning.

Thoughts?

cc: @zuercher @trjordan @derekargueta @brian-pane 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subset load balancer scalability concerns #3929

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

subset load balancer scalability concerns #3929

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions