Skip to content

Improve grpc timeout interaction with outlier detection #6566

@snowp

Description

@snowp

Envoy currently supports reading the grpc-timeout header for setting the global timeout for the request, but doing so results in a race between the gRPC client timing out vs Envoy timing out: if the client times out, we get a downstream reset while if Envoy times out we get a DEADLINE_EXCEEDED response.

In most cases this is fine (client sees a timeout in either case) but it means that Envoy's outlier detection is not able to accurately account for gRPC timeouts when the gRPC client times out first. From our data it seems like the client will time out the request before Envoy in most cases.

Some options I can think of:

  • Synthetically adjust the timeout provided by grpc-timeout, for example by decreasing it by 1ms. This would reduce the likelihood that the gRPC client won the timeout race, although not eliminate it.
  • Treat downstream resets close to the global timeout as timeouts, for example treat any downstream reset less than 1ms away from the global timeout is treated as a timeout.
  • Begin the global timeout timer earlier: @mpuncel noted that the global timeout starts after the router sees the entire request, which might explain why the gRPC client times out more quickly in most cases. This would not fix this issue but might make the issue less prevalent.
  • Do nothing and tell people to set x-envoy-upstream-rq-timeout-ms lower than the grpc-timeout (or avoid the use of a deadline in the client lib altogether). This isn't great but would require no changes to Envoy

Metadata

Metadata

Assignees

Labels

design proposalNeeds design doc/proposal before implementation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions