Skip to content

Panic mode on cluster membership changes? #6653

@akropp-stripe

Description

@akropp-stripe

Description:
We are investigating some odd panic routing metrics that envoy is emitting. As far as we can tell nothing is actually wrong with our request pipelines, but we want to trace this down as its causing spurious alerts.

We can clearly correlate cluster membership changes:

image

With panic routing

image

Our stats all say that all attempted health checks passed, there are no failures:

image

We added health check event logging and can see that we get

add_healthy_event:	{	
   first_check:	 true	
}	

When the cluster membership changes.

I was opening this issue to see if there was any insight as to what could be causing this, or suggestions into debugging attempts.

Our relevant cluster configuration looks like:

"dynamic_active_clusters": [
    {
     "version_info": "d63a8ee91ca7f647e623c3c5113a61d62be6fc23e09dbd2b73a7dc85a2e50e37",
     "cluster": {
      "name": "internal_cluster",
      "type": "STRICT_DNS",
      "connect_timeout": "2s",
      "health_checks": [
       {
        "timeout": "3s",
        "interval": "4s",
        "unhealthy_threshold": 2,
        "healthy_threshold": 2,
        "http_health_check": {
         "path": "/healthcheck"
        },
        "no_traffic_interval": "4s",
        "event_log_path": "/var/log/envoy_health_event.log"
       }      
      "http2_protocol_options": {},
      "upstream_connection_options": {
       "tcp_keepalive": {
        "keepalive_time": 120
       }
      },
      "load_assignment": {
       "cluster_name": "apiori",
       "endpoints": [
        {
         "lb_endpoints": [
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "def.dns.entry",
              "port_value": 10652
             }
            }
           },
           "load_balancing_weight": 50
          },
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "abc.dns.entry",
              "port_value": 10652
             }
            }
           },
           "load_balancing_weight": 50
          }
         ]
        },
        {
         "lb_endpoints": [
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "xyz.dns.entry",
              "port_value": 10652
             }
            }
           },
           "load_balancing_weight": 100
          }
         ],
         "priority": 1
        },
        {
         "lb_endpoints": [
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "xyz.dns.entry",
              "port_value": 10652
             }
            }
           },
           "load_balancing_weight": 100
          }
         ],
         "priority": 2
        }
       ],
       "policy": {
        "overprovisioning_factor": 198
       }
      }
     },
     "last_updated": "2019-04-18T21:13:33.619Z"
    },

When we do cluster changes via a connected CDS we update the clusters LB endpoints to point to different dns entries, but otherwise everything stays the same.

We are on envoy version: envoy 0/1.9.0-dev//RELEASE live 1394162 3549119 1
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestions that are neither investigations, bugs, nor enhancements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions