-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
This is following up on previously closed Issue (#5347).
This one specifically targets to address just the stale Endpoints issue.
Title: Add support for specifying how long endpoints should be used
Description:
When disconnected from the Management Server, envoy would end up using last known information from the Mgmt server. This is probably good for a lot of usecases/deployment, but there are situations where this might actually be harmful for overall system.
*Stale Endpoints:
For a deployment where Management Server does load balancing and sends the endpoints with weights to the envoys. If a envoy is disconnected from the management server for significant period, it will continue to use the stale weights and endpoints. Although the envoy will route the traffic to endpoints, we might end have
- Some servers move to different ip:port causing less number of endpoints being usable. Which eventually results in server overload.
- Servers getting overloaded as the weights assigned are stale.
Proposal:
As we are specifically looking to identify stale information having this lease associated per Endpoints is an overkill.
There are two levels where this control signal can be added.
Envoy level (Config)
As part of the CDS response. This is applied to all the resources returned in CDS response, and only changes when there is new CDS response.
e.g.
message OutlierDetection {
. . .
// The max time for which an endpoint can be used after it was received as part of EDS/CDS
// Defaults to 0 which means never.
google.protobuf.Duration endpoint_stale_after = 12 [(validate.rules).duration.gt.seconds = 0];
}
Cluster level (EDS resources)
Associate per ClusterLoadAssignment, and is associated per assignment (every EDS response).
message ClusterLoadAssignment {
// The max time for which an endpoint of this cluster can be used after this assignment was received.
// Defaults to 0 which means never.
google.protobuf.Duration endpoint_stale_after = 5 [(validate.rules).duration.gt.seconds = 0];
}
Either of the above approach extends easily to Incremental xDS as what we are really tracking is how long has it been since the Mgmt Server sent us endpoints.