Skip to content

Envoy drops CDS update at high churn rate #14598

@howardjohn

Description

@howardjohn

Title: Envoy drops CDS update at high churn rate

Description:
I have a setup where I am rapidly pushing changes to a cluster, adding and removing the transport_socket. This is a contrived setup to replicate failures that we are seeing in real world scenarios.

Most of the time this is working fine, however, occasionally I see updates being dropped.

Control plane logs

2021-01-07T21:21:51.982156Z     info    ads     Push debounce stable[281] 14: 100.107504ms since last change, 418.54657ms since last push, full=true
2021-01-07T21:21:51.982783Z     info    ads     XDS: Pushing:2021-01-07T21:21:51Z/279 Services:16 ConnectedEndpoints:16  Version:2021-01-07T21:21:51Z/279
2021-01-07T21:21:51.989311Z     error   howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got NO dest rule
2021-01-07T21:21:51.989391Z     error   howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got dest rule <nil>
2021-01-07T21:21:52.027106Z     info    ads     CDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:123 size:96.5kB
2021-01-07T21:21:52.027558Z     info    ads     EDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:84 size:35.8kB empty:0 cached:84/84
2021-01-07T21:21:52.103725Z     info    ads     RDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:52 size:54.1kB
2021-01-07T21:21:52.373837Z     info    ads     Push debounce stable[282] 2: 102.770675ms since last change, 102.77432ms since last push, full=true
2021-01-07T21:21:52.374464Z     info    ads     XDS: Pushing:2021-01-07T21:21:52Z/280 Services:16 ConnectedEndpoints:16  Version:2021-01-07T21:21:52Z/280
2021-01-07T21:21:52.469416Z     error   howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got dest rule tls:<mode:SIMPLE >
2021-01-07T21:21:52.469545Z     error   howardjohn: for a-v1-6dcbd9c75c-62lgx.echo, got dest rule name:"envoy.transport_sockets.tls" typed_config:{[type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext]:{common_tls_context:{validation_context:{}}}}
2021-01-07T21:21:52.477093Z     info    ads     CDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:123 size:99.3kB
2021-01-07T21:21:52.504730Z     info    ads     EDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:84 size:35.8kB empty:0 cached:84/84
2021-01-07T21:21:52.508165Z     info    ads     RDS: PUSH for node:a-v1-6dcbd9c75c-62lgx.echo resources:52 size:54.1kB
2021-01-07T21:21:52.788352Z     debug   ads     ADS:EDS: REQ sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 Expired nonce received 9K0/I6+SE5s=161c435c-50c9-4a62-8d08-355b1b706ad5, sent sQLmdx6oAuc=11ea4af1-17fe-4602-a268-01bee2b560d3
2021-01-07T21:21:53.065072Z     debug   ads     ADS:RDS: REQ sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 Expired nonce received 9K0/I6+SE5s=36d32e57-f097-46aa-8ede-5a9d5cda5f36, sent sQLmdx6oAuc=7efbb21c-1b0b-4eb5-a076-3b8fd2adc840
2021-01-07T21:21:53.065454Z     debug   ads     ADS:EDS: ACK sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 2021-01-07T21:21:52Z/280 sQLmdx6oAuc=11ea4af1-17fe-4602-a268-01bee2b560d3
2021-01-07T21:21:53.104101Z     debug   ads     ADS:RDS: ACK sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 2021-01-07T21:21:52Z/280 sQLmdx6oAuc=7efbb21c-1b0b-4eb5-a076-3b8fd2adc840
2021-01-07T21:21:53.135356Z     debug   ads     ADS:CDS: ACK sidecar~10.10.0.13~a-v1-6dcbd9c75c-62lgx.echo~echo.svc.cluster.local-6 2021-01-07T21:21:52Z/280 sQLmdx6oAuc=7f72114a-fc74-4774-87f5-e9211fae920b

What this is showing is we are pushing CDS version 279 at 21:21:52.027106Z without transport socket, then version 280 at 21:21:52.477093Z with transport socket.

At 21:21:53.135356Z, we get an ACK for version 280. Looking at config_dump, we see the cluster is stuck at version 2021-01-07T21:21:51Z/279 (both in version_info and the missing transport_socket).

Note that our control plane, unlike go-control-plane, does NOT wait for an ACK before sending the new update

Envoy logs

2021-01-07T21:21:48.261810Z     debug   envoy config    Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:48Z/275
2021-01-07T21:21:48.261837Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:48.366361Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:48.440945Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:48Z/275
2021-01-07T21:21:48.444100Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:48.623280Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:49.490006Z     debug   envoy config    Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:49Z/276
2021-01-07T21:21:49.509028Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:49.986808Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.090631Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:49Z/276
2021-01-07T21:21:50.094987Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:50.236496Z     debug   envoy config    Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:49Z/277
2021-01-07T21:21:50.236511Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.288315Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:49Z/277
2021-01-07T21:21:50.289054Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:50.370331Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.713583Z     debug   envoy config    Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:50Z/278
2021-01-07T21:21:50.713608Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:50.944320Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:50.975549Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:50Z/278
2021-01-07T21:21:50.976323Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:51.138577Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:52.062032Z     debug   envoy config    Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:51Z/279
2021-01-07T21:21:52.062062Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 0)
2021-01-07T21:21:52.402557Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:52.475416Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:51Z/279
2021-01-07T21:21:52.476406Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:52.790452Z     debug   envoy config    Received gRPC message for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 2021-01-07T21:21:52Z/280
2021-01-07T21:21:52.790484Z     debug   envoy config    Pausing discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)
2021-01-07T21:21:53.037956Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster accepted with 123 resources with version 2021-01-07T21:21:52Z/280
2021-01-07T21:21:53.039770Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 2)
2021-01-07T21:21:53.132343Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.cluster.v3.Cluster (previous count 1)

Repro steps:
I have only reproduced this in a pretty complex multicluster Istio environment so far, so pretty hard for others to replicate unfortunately. I am happy to extract more info and/or try to replicate it with a simpler setup

Metadata

Metadata

Assignees

No one assigned

    Labels

    buginvestigatePotential bug that needs verification

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions