Improve grpc timeout interaction with outlier detection

Envoy currently supports reading the `grpc-timeout` header for setting the global timeout for the request, but doing so results in a race between the gRPC client timing out vs Envoy timing out: if the client times out, we get a downstream reset while if Envoy times out we get a `DEADLINE_EXCEEDED` response. 

In most cases this is fine (client sees a timeout in either case) but it means that Envoy's outlier detection is not able to accurately account for gRPC timeouts when the gRPC client times out first. From our data it seems like the client will time out the request before Envoy in most cases.

Some options I can think of:
* Synthetically adjust the timeout provided by `grpc-timeout`, for example by decreasing it by 1ms. This would reduce the likelihood that the gRPC client won the timeout race, although not eliminate it.
* Treat downstream resets close to the global timeout as timeouts, for example treat any downstream reset less than 1ms away from the global timeout is treated as a timeout.
* Begin the global timeout timer earlier: @mpuncel noted that the global timeout starts after the router sees the *entire request*, which might explain why the gRPC client times out more quickly in most cases. This would not fix this issue but might make the issue less prevalent. 
* Do nothing and tell people to set `x-envoy-upstream-rq-timeout-ms` lower than the grpc-timeout (or avoid the use of a deadline in the client lib altogether). This isn't great but would require no changes to Envoy



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve grpc timeout interaction with outlier detection #6566

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve grpc timeout interaction with outlier detection #6566

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions