-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Envoy drops CDS update at high churn rate #14598
Description
Title: Envoy drops CDS update at high churn rate
Description:
I have a setup where I am rapidly pushing changes to a cluster, adding and removing the transport_socket. This is a contrived setup to replicate failures that we are seeing in real world scenarios.
Most of the time this is working fine, however, occasionally I see updates being dropped.
Control plane logs
2021-01-07T21:21:51.982156Z info ads Push debounce stable[281] 14: 100.107504ms since last change, 418.54657ms since last push, full=true
2021-01-07T21:21:51.982783Z info ads XDS: Pushing:2021-01-07T21:21:51Z/279 Services:16 ConnectedEndpoints:16 Version:2021-01-07T21:21:51Z/279
2021-01-07T21:21:51.989311Z error howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got NO dest rule
2021-01-07T21:21:51.989391Z error howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got dest rule <nil>
2021-01-07T21:21:52.027106Z info ads CDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:123 size:96.5kB
2021-01-07T21:21:52.027558Z info ads EDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:84 size:35.8kB empty:0 cached:84/84
2021-01-07T21:21:52.103725Z info ads RDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:52 size:54.1kB
2021-01-07T21:21:52.373837Z info ads Push debounce stable[282] 2: 102.770675ms since last change, 102.77432ms since last push, full=true
2021-01-07T21:21:52.374464Z info ads XDS: Pushing:2021-01-07T21:21:52Z/280 Services:16 ConnectedEndpoints:16 Version:2021-01-07T21:21:52Z/280
2021-01-07T21:21:52.469416Z error howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got dest rule tls:<mode:SIMPLE >
2021-01-07T21:21:52.469545Z error howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got dest rule name:"envoy.transport_sockets.tls" typed_config:{[type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext]:{common_tls_context:{validation_context:{}}}}
2021-01-07T21:21:52.477093Z info ads CDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:123 size:99.3kB
2021-01-07T21:21:52.504730Z info ads EDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:84 size:35.8kB empty:0 cached:84/84
2021-01-07T21:21:52.508165Z info ads RDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:52 size:54.1kB
2021-01-07T21:21:52.788352Z debug ads ADS:EDS: REQ sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 Expired nonce received 9K0/I6+SE5s=161c435c-50c9-4a62-8d08-355b1b706ad5, sent sQLmdx6oAuc=11ea4af1-17fe-4602-a268-01bee2b560d3
2021-01-07T21:21:53.065072Z debug ads ADS:RDS: REQ sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 Expired nonce received 9K0/I6+SE5s=36d32e57-f097-46aa-8ede-5a9d5cda5f36, sent sQLmdx6oAuc=7efbb21c-1b0b-4eb5-a076-3b8fd2adc840
2021-01-07T21:21:53.065454Z debug ads ADS:EDS: ACK sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 2021-01-07T21:21:52Z/280 sQLmdx6oAuc=11ea4af1-17fe-4602-a268-01bee2b560d3
2021-01-07T21:21:53.104101Z debug ads ADS:RDS: ACK sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 2021-01-07T21:21:52Z/280 sQLmdx6oAuc=7efbb21c-1b0b-4eb5-a076-3b8fd2adc840
2021-01-07T21:21:53.135356Z debug ads ADS:CDS: ACK sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 2021-01-07T21:21:52Z/280 sQLmdx6oAuc=7f72114a-fc74-4774-87f5-e9211fae920b
What this is showing is we are pushing CDS version 279 at 21:21:52.027106Z without transport socket, then version 280 at 21:21:52.477093Z with transport socket.
At 21:21:53.135356Z, we get an ACK for version 280. Looking at config_dump, we see the cluster is stuck at version 2021-01-07T21:21:51Z/279 (both in version_info and the missing transport_socket).
Note that our control plane, unlike go-control-plane, does NOT wait for an ACK before sending the new update
Envoy logs
2021-01-07T21:21:48.261810Z debug envoy config Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:48Z/275
2021-01-07T21:21:48.261837Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:48.366361Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:48.440945Z debug envoy config gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:48Z/275
2021-01-07T21:21:48.444100Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:48.623280Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:49.490006Z debug envoy config Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:49Z/276
2021-01-07T21:21:49.509028Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:49.986808Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.090631Z debug envoy config gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:49Z/276
2021-01-07T21:21:50.094987Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:50.236496Z debug envoy config Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:49Z/277
2021-01-07T21:21:50.236511Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.288315Z debug envoy config gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:49Z/277
2021-01-07T21:21:50.289054Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:50.370331Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.713583Z debug envoy config Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:50Z/278
2021-01-07T21:21:50.713608Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:50.944320Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.975549Z debug envoy config gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:50Z/278
2021-01-07T21:21:50.976323Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:51.138577Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:52.062032Z debug envoy config Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:51Z/279
2021-01-07T21:21:52.062062Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:52.402557Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:52.475416Z debug envoy config gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:51Z/279
2021-01-07T21:21:52.476406Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:52.790452Z debug envoy config Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:52Z/280
2021-01-07T21:21:52.790484Z debug envoy config Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:53.037956Z debug envoy config gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:52Z/280
2021-01-07T21:21:53.039770Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:53.132343Z debug envoy config Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
Repro steps:
I have only reproduced this in a pretty complex multicluster Istio environment so far, so pretty hard for others to replicate unfortunately. I am happy to extract more info and/or try to replicate it with a simpler setup