-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Issue Template
Title: Some 503 conditions not being retried, even with 503 as a retry code
Description:
In an Istio environment every service has a default retry policy now. It looks like this:
"route": {
"cluster": "outbound|80||my-service.default.svc.cluster.local",
"timeout": "0s",
"retry_policy": {
"retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
"num_retries": 2,
"retry_host_predicate": [
{
"name": "envoy.retry_host_predicates.previous_hosts"
}
],
"host_selection_retry_max_attempts": "3",
"retriable_status_codes": [
503
]
},
"max_grpc_timeout": "0s"
}
During scale-down events, I get logs like the following on the client-side that appear to not be re-tried:
[2019-04-25 00:56:30.950][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C11540] remote close
[2019-04-25 00:56:30.950][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C11540] closing socket: 0
[2019-04-25 00:56:30.950][31][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C11540] disconnect. resetting 1 pending requests
[2019-04-25 00:56:30.950][31][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C11540] request reset
[2019-04-25 00:56:30.950][31][debug][router] [external/envoy/source/common/router/router.cc:644] [C11555][S6290924653342831959] upstream reset: reset reason connection termination
[2019-04-25 00:56:30.951][31][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-04-25 00:56:30.951][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1243] [C11555][S6290924653342831959] closing connection due to connection close header
[2019-04-25 00:56:30.951][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1305] [C11555][S6290924653342831959] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Thu, 25 Apr 2019 00:56:30 GMT'
'server', 'envoy'
'connection', 'close'
[2019-04-25 00:56:30.951][31][debug][filter] [src/envoy/http/mixer/filter.cc:205] Called Mixer::Filter : onDestroy state: 2
[2019-04-25 00:56:30.951][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C11555] closing data_to_write=248 type=2
[2019-04-25 00:56:30.951][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:153] [C11555] setting delayed close timer with timeout 1000 ms
[2019-04-25 00:56:30.951][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:129] [C11540] client disconnected, failure reason:
[2019-04-25 00:56:30.951][31][debug][filter] [src/envoy/http/mixer/filter.cc:219] Called Mixer::Filter : log
[2019-04-25 00:56:30.951][31][debug][filter] [./src/envoy/http/mixer/report_data.h:132] No dynamic_metadata found for filter envoy.filters.http.rbac
If I override the default retry with my own that uses retryOn: gateway-error, it completely addresses the issue. I would have expected connect-failure to work on it's own, but if nto that, then certainly retriable-status-codes with 503.
I'm new to debugging these systems but happy to check anything else I can provide if guided a little.
[optional Relevant Links:]
Istio issue that led to opening one here as well: istio/istio#13616
This also seems similar to #5876, but I'm not sure if it's really the same issue or not.