-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Envoy is becoming unstable (throughput drops) due to CPU spikes (>40% usage) when using the subset LB under the following conditions:
- an upstream cluster with > 1k endpoints
- where each endpoint has 3 metadata keys
- and there's at least 6 subsets active at any time
- and healthchecks change state at a rate of ~40/sec, which happens during deploys or when the cluster is rapidly scaling up or down
Although we've done some work to ensure rebuilding the subset LB is more efficient (17efc83), the spikes — albeit reduced a bit — are still there.
Relaxing the healtcheck thresholds would help. Though, it could also yield the undesired result of having valid healthchecks kick in when it's too late.
A better solution might be to rate limit or coalesce healthcheck state changes when there's too many of them being triggered. That could maybe be done somewhere around:
https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/cluster_manager_impl.cc#L583.
There's also the possibility, previously mentioned by @mattklein123, of moving the subset LB rebuild work into the main thread and then posting the results.
Ideally, the subset LB should be able to operate under the above conditions without any manual tuning.
Thoughts?