-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Title: cluster membership race when streaming eds data and active health checking
Description:
Now that Lyft has rolled out streaming xDS data to all the envoys, we are seeing a race condition with service discovery.
During shut down, the host fails health check and then deregisters with discovery. We are running into a problem where the downstream envoys get the updated EDS data first before it gets the failed health check. This causes cluster membership to go out of sync until a membership change happens (which could be a long time).
This was not a problem before, as the sidecar envoys were polling the discovery data every 30 seconds. But now that we moved to the xDS protocal, we are seeing this race.
This is a bigger issue with Kubernetes services, as the k8s api server streams service discovery data very quickly (and sometimes infrequently) to our control plane, potentially entering into panic routing mode.
After talking with Matt, the proposed solution is to not just reconcile host membership on when envoy recieves EDS data, but also when envoy recieves a failed health check from an upstream host.
Would love to get consensus from the community, particulary @snowp, on the best path forward. We would like host membership to reconile very quickly if possible.