Skip to content

Support for endpoint lease ( How long endpoints should be used) #6420

@vishalpowar

Description

@vishalpowar

This is following up on previously closed Issue (#5347).
This one specifically targets to address just the stale Endpoints issue.


Title: Add support for specifying how long endpoints should be used

Description:
When disconnected from the Management Server, envoy would end up using last known information from the Mgmt server. This is probably good for a lot of usecases/deployment, but there are situations where this might actually be harmful for overall system.

*Stale Endpoints:
For a deployment where Management Server does load balancing and sends the endpoints with weights to the envoys. If a envoy is disconnected from the management server for significant period, it will continue to use the stale weights and endpoints. Although the envoy will route the traffic to endpoints, we might end have

  • Some servers move to different ip:port causing less number of endpoints being usable. Which eventually results in server overload.
  • Servers getting overloaded as the weights assigned are stale.

Proposal:

As we are specifically looking to identify stale information having this lease associated per Endpoints is an overkill.
There are two levels where this control signal can be added.

Envoy level (Config)
As part of the CDS response. This is applied to all the resources returned in CDS response, and only changes when there is new CDS response.
e.g.

message OutlierDetection { 
.   .   .
  // The max time for which an endpoint can be used after it was received as part of EDS/CDS
  // Defaults to 0 which means never.
  google.protobuf.Duration  endpoint_stale_after = 12 [(validate.rules).duration.gt.seconds = 0];
}

Cluster level (EDS resources)
Associate per ClusterLoadAssignment, and is associated per assignment (every EDS response).

message ClusterLoadAssignment {
   // The max time for which an endpoint of this cluster can be used after this assignment was received.
   // Defaults to 0 which means never.
  google.protobuf.Duration  endpoint_stale_after = 5 [(validate.rules).duration.gt.seconds = 0];
}

Either of the above approach extends easily to Incremental xDS as what we are really tracking is how long has it been since the Mgmt Server sent us endpoints.

Metadata

Metadata

Assignees

No one assigned

    Labels

    design proposalNeeds design doc/proposal before implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions