sds: cluster not warming while certificates are being fetched; immediately marked active

When creating clusters that reference SDS certificates, the warming behavior does not seem correct. My expectation is that until the secret is sent, the cluster will be marked as "warming" until the initial_fetch_timeout, and block the rest of initialization from occuring.

What I am actually seeing is initialization *is* blocked, but there is nothing indicating the clusters are warming.

Using this config:
`docker run -v $HOME/kube/local:/config -p 15000:15000 envoyproxy/envoy-dev -c /config/envoy-sds-lds.yaml --log-format-prefix-with-location 0 --reject-unknown-dynamic-fields`

with `envoy  version: 49efb9841a58ebdc43a666f55c445911c8e4181c/1.15.0-dev/Clean/RELEASE/BoringSSL`

and config files: 

cds.yaml:
```yaml
resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: outbound_cluster_tls
  connect_timeout: 5s
  max_requests_per_connection: 1
  load_assignment:
    cluster_name: xds-grpc
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: 127.0.0.1
              port_value: 8080
  type: STATIC
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext
      common_tls_context:
        tls_certificate_sds_secret_configs:
          - name: "default"
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                  - envoy_grpc:
                      cluster_name: "sds-grpc"
                refresh_delay: 60s
        combined_validation_context:
          default_validation_context: {}
          validation_context_sds_secret_config:
            name: ROOTCA
            sds_config:
              initial_fetch_timeout: 20s
              api_config_source:
                api_type: GRPC
                grpc_services:
                - envoy_grpc:
                    cluster_name: sds-grpc    
```

envoy-sds-lds.yaml:
```yaml
admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 15000
node:
  id: id
  cluster: sdstest
dynamic_resources:
  lds_config:
    api_config_source:
      api_type: GRPC
      grpc_services:
        envoy_grpc:
          cluster_name: lds
  cds_config:
    path: /config/cds.yaml
static_resources:
  clusters:    
  - name: sds-grpc
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN
  - name: lds
    type: STATIC
    http2_protocol_options: {}
    connect_timeout: 5s
    lb_policy: ROUND_ROBIN
```

Basically what should happen here is we get a dynamic CDS cluster with SDS config. This SDS config fails, as the sds server is not setup. We have initial_fetch_timeout, so for 20s everything should be warming.

What we see instead:

* Stats are not showing warming:
```
cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 1
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 1
cluster_manager.cds.update_time: 1588972075968
cluster_manager.cds.version: 17241709254077376921
cluster_manager.cluster_added: 3
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 0
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0
```

We also see `init_fetch_timeout` is `0`; this does not change after 20s

* LDS is not requested until 20s later, indicating the initial_fetch_timeout is respected. This can be seen in the logs:
(note - for simple testing I don't have a real LDS server, but we can see its not even attempted until 20s in)
```
[2020-05-08 21:07:55.967][1][info][upstream] cds: add 1 cluster(s), remove 2 cluster(s)
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:55.968][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:55.968][1][info][upstream] cds: add/update cluster 'outbound_cluster_tls'
[2020-05-08 21:07:55.968][1][info][main] starting main dispatch loop
[2020-05-08 21:07:56.703][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.703][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:56.938][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:56.938][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.135][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.135][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:57.682][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:57.682][1][warning][config] Unable to establish new stream
[2020-05-08 21:07:58.671][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:07:58.671][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:08.992][1][warning][config] StreamSecrets gRPC config stream closed: 14, no healthy upstream
[2020-05-08 21:08:08.992][1][warning][config] Unable to establish new stream
[2020-05-08 21:08:15.967][1][info][upstream] cm init: all clusters initialized
[2020-05-08 21:08:15.967][1][info][main] all clusters initialized. initializing init manager
[2020-05-08 21:08:15.967][1][warning][config] StreamListeners gRPC config stream closed: 14, no healthy upstream
```

* `dynamic_active_clusters` shows the cluster in cds.yaml. I would expect it to be "warming".

This example above is meant to simplify it, I have originally seen this with a normal deployment using ADS gRPC server (Istio) not just files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sds: cluster not warming while certificates are being fetched; immediately marked active #11120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions