-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Closed
Labels
design proposalNeeds design doc/proposal before implementationNeeds design doc/proposal before implementation
Description
Envoy currently supports reading the grpc-timeout header for setting the global timeout for the request, but doing so results in a race between the gRPC client timing out vs Envoy timing out: if the client times out, we get a downstream reset while if Envoy times out we get a DEADLINE_EXCEEDED response.
In most cases this is fine (client sees a timeout in either case) but it means that Envoy's outlier detection is not able to accurately account for gRPC timeouts when the gRPC client times out first. From our data it seems like the client will time out the request before Envoy in most cases.
Some options I can think of:
- Synthetically adjust the timeout provided by
grpc-timeout, for example by decreasing it by 1ms. This would reduce the likelihood that the gRPC client won the timeout race, although not eliminate it. - Treat downstream resets close to the global timeout as timeouts, for example treat any downstream reset less than 1ms away from the global timeout is treated as a timeout.
- Begin the global timeout timer earlier: @mpuncel noted that the global timeout starts after the router sees the entire request, which might explain why the gRPC client times out more quickly in most cases. This would not fix this issue but might make the issue less prevalent.
- Do nothing and tell people to set
x-envoy-upstream-rq-timeout-mslower than the grpc-timeout (or avoid the use of a deadline in the client lib altogether). This isn't great but would require no changes to Envoy
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
design proposalNeeds design doc/proposal before implementationNeeds design doc/proposal before implementation