-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description
I don't know if this is intended or not but it seems that the global timeouts set on the route via timeout [Link] doesn't work when max_stream_duration [Link] is also set.
Documentation on either of these configs doesn't mention that they can't be used together but, looking at the code it seems to be the case.
https://github.com/envoyproxy/envoy/blob/main/source/common/router/router.cc#L137
if (!route.usingNewTimeouts()) {
...
}
What's a little confusing/surprising to us is the fact that setting the global timeouts using the header x-envoy-upstream-rq-timeout-ms still works even when max_stream_duration is supplied.
https://github.com/envoyproxy/envoy/blob/main/source/common/router/router.cc#L167
if (respect_expected_rq_timeout) {
...
}
Is this behavior intended? If this is indeed the intention then we should update the docs and clarify the behavior of x-envoy-upstream-rq-timeout-ms.
Use Case
We have a mix of streaming and non-streaming routes where we are setting both the timeout as well as max_stream_duration Current Configuration: (max_stream_duration = timeout + 240s).
We were expecting that the non-streaming routes won't have any impact from adding max_stream_duration and these would still continue to return 504 status codes once the global timeout get exhausted. Streaming route on the other hand would benefit from setting max_stream_duration as global timeout isn't very useful on these routes.
But after the change of adding a higher max_stream_duration on all the routes, the non-streaming routes started returning a 408 status code, not respecting the global timeouts being set via timeout config. Since it changed the final status code being returned from 504 --> 408, it started breaking some of our existing clients who have a specific post-processing logic for handling/retrying on the 504 status codes.
For now we have mitigated the issue by setting the x-envoy-upstream-rq-timeout-ms header on these routes as this header is still respected irrespective of whether or not the max_stream_duration is set on the routes. We would like to understand the correct/long term fix for this.
Repro Steps
Try to set both the timeout as well as max_stream_duration on any of your route definitions. Even you you set a lower global timeout (say 60s) and a higher stream timeout (say 300s), the global timeout timer will never fire and you'll end up getting a 408 status code after 300s when the stream timeout timer fires.
{
"cluster": "authN",
"timeout": "60s",
"idle_timeout": "3600s",
"max_stream_duration": "{\"max_stream_duration\": \"300s\"}",
...
}