Skip to content

Support xDS partial rejections #3079

@ikatson

Description

@ikatson

Hi, envoy team, we've been using envoy lately a lot in all imaginable pieces of infrastructure, we love it. One of the use-cases we tried to cover is to create a highly-customized kubernetes ingress controller tailored for our needs.

We use a GRPC ADS discovery based on https://github.com/envoyproxy/java-control-plane, which sends updates to envoy as a "snapshot", i.e. all resources (listeners/clusters) at once. We've found it surprisingly easy to break the envoy-discovery system, where one misconfigured resource breaks the envoy-discovery tandem. E.g. when we generate 100 listeners, and 1 end up having a bad certificate chain, none of the 100 listeners get loaded and a few bugs happen, described below.

The backend generates configurations based on user-supplied Kubernetes Ingress resources which might contain bugs. When the dynamically generated listener or cluster configurations are broken, here's what happens:

  • envoy immediately rejects the whole configuration, not just the broken listener/cluster, and effectively does not load any configuration at all
  • envoy starts polling it again from discovery like in a "while-true" loop, without any backoffs, which saturates both envoy's and discovery backend's resource usage. It hits the buggy configuration in a loop, and keeps polling
  • the error messages sometimes don't show what is the resource that is broken

E.g. here's the envoy's error message, when the user uploads a broken certificate chain. Notice, that it happens frequently in a loop, and it does not show which listener is broken, which increases the time for us to find the broken ingress configuration:

[2018-04-15 20:19:18.704][12883858][warning][config] bazel-out/darwin-fastbuild/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:69] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Failed to load certificate chain from <inline>
[2018-04-15 20:19:18.760][12883858][warning][upstream] source/common/config/grpc_mux_impl.cc:188] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Failed to load certificate chain from <inline>
[2018-04-15 20:19:18.760][12883858][warning][config] bazel-out/darwin-fastbuild/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:69] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Failed to load certificate chain from <inline>
[2018-04-15 20:19:18.821][12883858][warning][upstream] source/common/config/grpc_mux_impl.cc:188] gRPC config for type.googleapis.com/envoy.api.v2.Listener update rejected: Failed to load certificate chain from <inline>

and it keeps looping forever without any backoffs.

Expected behavior:

  • envoy would reject only the listener/cluster that is broken for whatever reason
  • if a listener/cluster configuration is broken, envoy would print in the logs which listener/cluster is broken
  • envoy would not restart the whole discovery process (the grpc xds discovery request), as this will give it the same broken resource again. Instead, it could just keep getting the streamed updates.
  • if envoy needs to re-initiate the discovery for some reason, it would do so with some kind of backoff, or sleep, so that it does not saturate CPU for both envoy and discovery

While implementing the discovery backend, we've seen quite a few of those cases, but the desired/expected behavior in all of the cases is the same - envoy should load all listeners/clusters that are OK, and should ignore and complain about the ones that have errors.

P.S. One notable scenario is when a listener is loaded, which references the cluster that does not exist in envoy's internal state yet. We've seen this happen in a "race" fashion, where the cluster actually existed in the snapshot, but it took envoy a few disocvery-error cycles to realize that. In this scanario

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions