Skip to content

gRPC retries, assuming gRPC status code is in the returned headers#1067

Merged
htuch merged 29 commits intoenvoyproxy:masterfrom
alyssawilk:grpc
Jun 12, 2017
Merged

gRPC retries, assuming gRPC status code is in the returned headers#1067
htuch merged 29 commits intoenvoyproxy:masterfrom
alyssawilk:grpc

Conversation

@alyssawilk
Copy link
Copy Markdown
Contributor

for issue #721

Copy link
Copy Markdown
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! Could you add the retry policies to docs as well?

@alyssawilk
Copy link
Copy Markdown
Contributor Author

alyssawilk commented Jun 8, 2017 via email


class Common {
public:
enum GrpcStatus {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful if this could be moved to include/envoy/grpc/status.h, since I would like to use this in interface files for the bidi client I'm working on.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gRPC status code are identical with protobuf status code, so just re-use it might be enough.

Copy link
Copy Markdown
Member

@rshriram rshriram Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @htuch said. I would like to re-use those for fault injection as well. Do we also have HTTP2 specific error codes in a similar place (in addition to HTTP1)? All I see are HTTP1 codes in include/envoy/http/codes.h

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally a fan of decoupling protobuf from gRPC. As we know from the other stuff going on with grpc-web, there's not necessarily a direct correspondence between gRPC and protobuf.

static const uint32_t RETRY_ON_CONNECT_FAILURE = 0x2;
static const uint32_t RETRY_ON_RETRIABLE_4XX = 0x4;
static const uint32_t RETRY_ON_REFUSED_STREAM = 0x8;
static const uint32_t RETRY_ON_GRPC_CANCELLED = 0x16;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x10, 0x20, 0x40

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facepalm Done :-P

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see this in the updated diff yet.


class Common {
public:
enum GrpcStatus {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gRPC status code are identical with protobuf status code, so just re-use it might be enough.

only supported for gRPC status codes in response headers. gRPC status codes in trailers will not trigger
retry logic.One or more policies can be specified using a ',' delimited list. The supported policies are:

cancelled
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancelled are returned typically when the call are cancelled by the caller, not sure a proxy should retry.

http://www.grpc.io/docs/guides/concepts.html#cancelling-rpcs

Copy link
Copy Markdown
Contributor Author

@alyssawilk alyssawilk Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be worth taking up with @twoism over in #721 ? If you two work out a good set I'm happy to update the patch to whatever you agree on.


// We short circuit here and do not both with an allocation if there is no chance we will retry.
if (request_headers.EnvoyRetryOn() || route_policy.retryOn()) {
if (request_headers.EnvoyRetryOn() || request_headers.EnvoyRetryGrpcOn() || route_policy.retryOn()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick comment (not full review), can we also add inline route support as is done in route_policy.retryOn() ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than extending this if block with more conditions (gRPC, HTTP2 error codes, etc.), would it make sense to abstract them under EnvoyRetryOn() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think request_headers.EnvoyRetryOn() returning true if there were a literal header x-envoy-grpc-retry-on would be confusing as all generally header functions map cleanly to the header specified. Most other code points could be covered under the general "retry" header if folks think that's cleaner.

Working on route policy now. I'm inclined to add the gRPC strings to the existing retry_on field rather than have retry_on and grpc_retry_on (and num_retries and num_grpc_retries) in router config. If anyone has concerns about HTTP vs gRPC collisions and prefers splitting them out, let me know!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And done.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Setting this header on egress requests will cause Envoy to attempt to retry failed requests (number of
retries defaults to 1, and is configured as x-envoy-grpc-retry is, above). gRPC retries are currently
only supported for gRPC status codes in response headers. gRPC status codes in trailers will not trigger
retry logic.One or more policies can be specified using a ',' delimited list. The supported policies are:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: logic. One.

static const uint32_t RETRY_ON_CONNECT_FAILURE = 0x2;
static const uint32_t RETRY_ON_RETRIABLE_4XX = 0x4;
static const uint32_t RETRY_ON_REFUSED_STREAM = 0x8;
static const uint32_t RETRY_ON_GRPC_CANCELLED = 0x16;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see this in the updated diff yet.

HttpTestUtility::addDefaultHeaders(headers);
router_.decodeHeaders(headers, true);

// 5xx response.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5xx?

Http::StreamEncoder* request_encoder;
request_encoder =
&codec_client->startRequest(Http::TestHeaderMapImpl{{":method", "POST"},
{":path", "/test/long/url"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: gRPC looks more like /service/method

{"x-envoy-retry-grpc-on", "cancelled"}},
*response);
codec_client->sendData(*request_encoder, 1024, false);
codec_client->sendTrailers(*request_encoder, request_trailers);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: gRPC requests don't have trailers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

request->encodeHeaders(Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "1"}}, false);
},
[&]() -> void {
if (fake_upstreams_[0]->httpType() == FakeHttpConnection::Type::HTTP1) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just exclude this from HTTP1 upstream tests? In general you can't proxy gRPC to arbitrary H1 backends I think (certainly not bidi streaming, also you need to get grpc-status into trailers, etc.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was worth testing because it is possible to translate gRPC to HTTP/1.1, and the retry logic doesn't check the connection type before checking for gRPC status codes. Basically if the code might be exercised for HTTP responses which include gRPC status codes for some bizarre reason, might as well test it.

@mattklein123
Copy link
Copy Markdown
Member

@alyssawilk before I forget please sign CLA: https://oss.lyft.com/cla/#/signature (CLA bot has been resourced!)


const std::string Common::GRPC_CONTENT_TYPE{"application/grpc"};

Status::GrpcStatus Common::getGrpcStatus(const Http::HeaderMap& trailers) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have Common::validateTrailers() make use of this to do its gRPC status parsing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It's a minor behavior change for out-of-bounds error codes. If we care we could split returns from getGrpcStatus into "missing" status and "out of bounds" status but either way we'll throw an exception so I assume it's moot.

@htuch htuch mentioned this pull request Jun 9, 2017

// This is a non-GRPC error code, indicating the status code in gRPC headers
// was either invalid or missing.
InvalidCode = 16,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decided copy this but not reusing protobuf (which is fine), let's not use 16 here, maybe -1?

https://github.com/grpc/grpc/blob/master/include/grpc%2B%2B/impl/codegen/status_code_enum.h

return Status::GrpcStatus::InvalidCode;
}
if (grpc_status_code > Status::GrpcStatus::InvalidCode) {
if (grpc_status_code > Status::GrpcStatus::DataLoss) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unauthenticated = 16 is the largest one

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor nits and ready to ship.

)

envoy_cc_library(
name = "status_interface",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could just call this "status", since it's not as much an interface as an independent interface-level library (we do something similar for optional).

// Permission was denied for the RPC.
PermissionDenied = 7,
// The RPC does not have required credentials for the RPC to succeed.
Unauthenticated = 16,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reorder this so that this is after DataLoss?


uint64_t grpc_status_code;
if (!grpc_status_header || grpc_status_header->value().empty()) {
return Status::GrpcStatus::MissingStatus;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to make this an optional return type, since having a GrpcStatus that indicates status missing is somewhat self contradictory.

return Optional<Status::GrpcStatus>();
}
if (!StringUtil::atoul(grpc_status_header->value().c_str(), grpc_status_code) ||
grpc_status_code > Status::GrpcStatus::DataLoss) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs updating.

htuch
htuch previously approved these changes Jun 12, 2017
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. few small question/comments.

x-envoy-grpc-retry-on
^^^^^^^^^^^^^^^^^^^^^
Setting this header on egress requests will cause Envoy to attempt to retry failed requests (number of
retries defaults to 1, and is configured as x-envoy-grpc-retry is, above). gRPC retries are currently
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"x-envoy-grpc-retry" I think is a typo? (Not sure what you meant here)

resource-exhausted
Envoy will attempt a retry if the gRPC status code in the response headers is "resource-exhausted" (8)

As with x-envoy-grpc-retry, the number of retries can be controlled via the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

retry_on
*(required, string)* specifies the conditions under which retry takes place. These are the same
conditions documented for :ref:`config_http_filters_router_x-envoy-retry-on`.
conditions documented for :ref:`config_http_filters_router_x-envoy-retry-on` and
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd to me that in the route, we stick HTTP/gRPC retry policies into a single field, but we have 2 different headers. I don't really feel strongly about this, but thought I would mention it in case anyone else has any strong opinions.

@htuch
Copy link
Copy Markdown
Member

htuch commented Jun 12, 2017

Hopefully we can get a merge on this today if there isn't anything major left, my PR (#1057) needs this before I can move it from non-WIP status for final review.

@mattklein123
Copy link
Copy Markdown
Member

LGTM, I still think we might want to think about how things are configured between headers and routes but we can change this in a follow up before 1.4.0 is released if we want.

@htuch htuch merged commit 7828e38 into envoyproxy:master Jun 12, 2017
@alyssawilk alyssawilk deleted the grpc branch June 15, 2017 19:39
jpsim pushed a commit that referenced this pull request Nov 28, 2022
…#1067)

Description: scope destruction should happen ahead of destroying main_common_. Additionally while cross thread destruction is acceptable, this PR also moves destruction to happen in the context of the main thread.
Risk Level: med - resolve crash
Testing: repro in alpha builds of the lyft app.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
…#1067)

Description: scope destruction should happen ahead of destroying main_common_. Additionally while cross thread destruction is acceptable, this PR also moves destruction to happen in the context of the main thread.
Risk Level: med - resolve crash
Testing: repro in alpha builds of the lyft app.

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

Garbage was checked in accidentally via a doc fix PR #1030, so this
commit removes it

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants