Skip to content

503s and _No_route_to_host errors due to routing to non-existent Endpoints #4685

@coro

Description

@coro

Description:
We have been seeing many 503 errors when connecting to a Service with a lot of Pod churn.

We also saw in the logs that in these cases, the upstream_host that Envoy was attempting to connect to was for Pods that no longer existed in the cluster. These Pods could have been terminated over 50 mins earlier.

Repro steps:
Simple setup of Gateway (AWS NLB) -> HTTPRoute -> Service pointing to a Deployment with a lot of Pod churn.

Environment:
Gateway: v1.2.1 (also seen on v1.1.3) (not seen on v1.1.1)
Envoy: v1.32.1 (also seen on v1.31.1) (not seen on v1.31.0)
EKS cluster v1.29

Logs:

{
  "start_time": "2024-11-08T13:54:35.382Z",
  "method": "POST",
  "x-envoy-origin-path": "/v1/models/custom-model:predict",
  "protocol": "HTTP/1.1",
  "response_code": "503",
  "response_flags": "UF",
  "response_code_details": "upstream_reset_before_response_started{remote_connection_failure|delayed_connect_error:_No_route_to_host}",
  "connection_termination_details": "-",
  "upstream_transport_failure_reason": "delayed_connect_error:_No_route_to_host",
  "bytes_received": "494",
  "bytes_sent": "165",
  "duration": "3060",
  "x-envoy-upstream-service-time": "-",
  "x-forwarded-for": "10.0.99.64",
  "user-agent": "python-requests/2.32.3",
  "x-request-id": "3dbc6bcc-79b1-4ec1-855c-1569b69f5416",
  ":authority": "kserve-detection.prod.signal",
  "upstream_host": "10.0.101.243:8080",
  "upstream_cluster": "httproute/default/detection/rule/0",
  "upstream_local_address": "-",
  "downstream_local_address": "10.0.101.92:10080",
  "downstream_remote_address": "10.0.99.64:41562",
  "requested_server_name": "-",
  "route_name": "httproute/default/detection/rule/0/match/0/kserve-detection_prod_signal"
}

The generated cluster config for this route was:

   "dynamic_active_clusters": [
    {
     "version_info": "a998a3b0100c28a29ff3ee662cb01dc45463f0260c9a6a50107dcc12a45938fc",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "httproute/default/detection/rule/0",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
       },
       "service_name": "httproute/default/detection/rule/0"
      },
      "connect_timeout": "10s",
      "per_connection_buffer_limit_bytes": 32768,
      "lb_policy": "LEAST_REQUEST",
      "circuit_breakers": {
       "thresholds": [
        {
         "max_retries": 1024
        }
       ]
      },
      "dns_lookup_family": "V4_ONLY",
      "outlier_detection": {},
      "common_lb_config": {
       "locality_weighted_lb_config": {}
      },
      "ignore_health_on_host_removal": true
     },
     "last_updated": "2024-11-08T14:29:21.557Z"
    },
...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions