-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Title: One line description
Description:
I have a fleet on Envoy running in Docker calling into upstream which is a Google Cloud hosted service, via h2.
Envoy periodically (very hard to repro) gets 503 with upstream_cx_destroy_remote_with_active_rq being bumped up.
I tried using https://www.envoyproxy.io/docs/envoy/v1.9.0/configuration/http_filters/router_filter#config-http-filters-router-x-envoy-retry-on retriable-status-codes, made 503 a retryable error code, however the retry does not take place.
The retry does take place only if I explicitly make upstream send the 503 (without bumping upstream_cx_destroy_remote_with_active_rq)
Is this expected? How can I make Envoy retry in this case?
This only happens once every 24-36 hours for a 200 QPS load. But when it does hundreds of requests fail at a time, causing a big damage. Turning out debug logs might slow everything down. Should I not be concerned about the IO/CPU increase and the effect of Envoy overhead on redirecting those requests?
Also, 5xx as a retry policy does not work. I explicitly want 503 to be retried on, so in 1.9 I am doing this but no effect for Envoy generated 503s as a result of upstream_cx_destroy_remote_with_active_rq. How do I proceed to give you meaningful info?
prefix: "/"
route:
timeout: 120s
cluster: test_cluster
host_rewrite: "x.y.com"
retry_policy:
retry_on: connect-failure,refused-stream,retriable-status-codes
num_retries: 2
retriable_status_codes: [503]
From Harvey Tuch:
Looking at RetryStateImpl::wouldRetryFromReset() and RetryStateImpl::wouldRetry() I'm not sure if they're handling the case of a mid-stream reset. Please file a GH issue and we can discuss further there.
[optional Relevant Links:]
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/envoy-users/fV87_K4cMks/LvGkWbSZGgAJ
is it same as #5023 ?