Skip to content

subset load balancer scalability concerns #3929

@rgs1

Description

@rgs1

Envoy is becoming unstable (throughput drops) due to CPU spikes (>40% usage) when using the subset LB under the following conditions:

  • an upstream cluster with > 1k endpoints
  • where each endpoint has 3 metadata keys
  • and there's at least 6 subsets active at any time
  • and healthchecks change state at a rate of ~40/sec, which happens during deploys or when the cluster is rapidly scaling up or down

Although we've done some work to ensure rebuilding the subset LB is more efficient (17efc83), the spikes — albeit reduced a bit — are still there.

Relaxing the healtcheck thresholds would help. Though, it could also yield the undesired result of having valid healthchecks kick in when it's too late.

A better solution might be to rate limit or coalesce healthcheck state changes when there's too many of them being triggered. That could maybe be done somewhere around:

https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/cluster_manager_impl.cc#L583.

There's also the possibility, previously mentioned by @mattklein123, of moving the subset LB rebuild work into the main thread and then posting the results.

Ideally, the subset LB should be able to operate under the above conditions without any manual tuning.

Thoughts?

cc: @zuercher @trjordan @derekargueta @brian-pane

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions