-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
When a cluster is created or updated envoy it enters warming phase and needs a related ClusterLoadAssignement response to fully initialize.
During Envoy startup phase envoy sends requests for those resources to the management server and so the management servers knows it has to respond those.
But, when updating a Cluster via CDS, no EDS re-request is sent to the management server and management server doesn't really know it should send a ClusterLoadAssignement for that Cluster, even if the resource hasn't really changed.
This berhavior introduces subtle bugs in management servers, in our case, our resource versioning scheme somehow included the cluster version, yesterday we introduced a change to remove that, and that unexpectedly broke our cluster updates, leaving some clusters without traffic.
This should probably be handled by envoy, currently go-control-plane and java-control-plane don't really handle this since it's kind of hard to induce this behavior of always pushing an EDS update for a Cluster even if the resource hasn't really changed, and if not using ADS, envoy is possibly connected to multiple management servers which increases the difficulty.
@htuch suggested that envoy should probably unsubscribe from EDS for the updated cluster and immediately subscribe again, since this is what's actually happening inside envoy, a new Cluster is being created and it wants to subscribe to the resources, and the old one wants to unsubscribe. Some more thought should be done on this idea and the consecuences of the old cluster being unsubscribed from resources.
Some discussion already happened on Slack, opening this issue to continue discussion.
This issue also applies to LDS updates with RDS and probably SRDS with RDS.