-
Notifications
You must be signed in to change notification settings - Fork 711
503s and _No_route_to_host errors due to routing to non-existent Endpoints #4685
Copy link
Copy link
Closed
Labels
Description
Description:
We have been seeing many 503 errors when connecting to a Service with a lot of Pod churn.
We also saw in the logs that in these cases, the upstream_host that Envoy was attempting to connect to was for Pods that no longer existed in the cluster. These Pods could have been terminated over 50 mins earlier.
Repro steps:
Simple setup of Gateway (AWS NLB) -> HTTPRoute -> Service pointing to a Deployment with a lot of Pod churn.
Environment:
Gateway: v1.2.1 (also seen on v1.1.3) (not seen on v1.1.1)
Envoy: v1.32.1 (also seen on v1.31.1) (not seen on v1.31.0)
EKS cluster v1.29
Logs:
{
"start_time": "2024-11-08T13:54:35.382Z",
"method": "POST",
"x-envoy-origin-path": "/v1/models/custom-model:predict",
"protocol": "HTTP/1.1",
"response_code": "503",
"response_flags": "UF",
"response_code_details": "upstream_reset_before_response_started{remote_connection_failure|delayed_connect_error:_No_route_to_host}",
"connection_termination_details": "-",
"upstream_transport_failure_reason": "delayed_connect_error:_No_route_to_host",
"bytes_received": "494",
"bytes_sent": "165",
"duration": "3060",
"x-envoy-upstream-service-time": "-",
"x-forwarded-for": "10.0.99.64",
"user-agent": "python-requests/2.32.3",
"x-request-id": "3dbc6bcc-79b1-4ec1-855c-1569b69f5416",
":authority": "kserve-detection.prod.signal",
"upstream_host": "10.0.101.243:8080",
"upstream_cluster": "httproute/default/detection/rule/0",
"upstream_local_address": "-",
"downstream_local_address": "10.0.101.92:10080",
"downstream_remote_address": "10.0.99.64:41562",
"requested_server_name": "-",
"route_name": "httproute/default/detection/rule/0/match/0/kserve-detection_prod_signal"
}The generated cluster config for this route was:
"dynamic_active_clusters": [
{
"version_info": "a998a3b0100c28a29ff3ee662cb01dc45463f0260c9a6a50107dcc12a45938fc",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "httproute/default/detection/rule/0",
"type": "EDS",
"eds_cluster_config": {
"eds_config": {
"ads": {},
"resource_api_version": "V3"
},
"service_name": "httproute/default/detection/rule/0"
},
"connect_timeout": "10s",
"per_connection_buffer_limit_bytes": 32768,
"lb_policy": "LEAST_REQUEST",
"circuit_breakers": {
"thresholds": [
{
"max_retries": 1024
}
]
},
"dns_lookup_family": "V4_ONLY",
"outlier_detection": {},
"common_lb_config": {
"locality_weighted_lb_config": {}
},
"ignore_health_on_host_removal": true
},
"last_updated": "2024-11-08T14:29:21.557Z"
},
...Reactions are currently unavailable