host: track per-host UpstreamLocalityStats.#1755
Conversation
This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be aggregated on a per-locality basis. No tests yet, looking for feedback on whether semantically we're doing the right thing by using response code to drive this. Signed-off-by: Harvey Tuch <htuch@google.com>
|
@mattklein123 @rohitbhoj for feedback on the approach before going too wild with writing tests. |
mattklein123
left a comment
There was a problem hiding this comment.
In general looks good, couple of comments.
source/common/router/router.cc
Outdated
| const uint64_t response_code = Http::Utility::getResponseStatus(info.response_headers_); | ||
| if (dropped) { | ||
| upstream_host->stats().rq_dropped_.inc(); | ||
| } else if (response_code == 200) { |
There was a problem hiding this comment.
If we are going by normal HTTP semantics, this should be !Http::CodeUtility::is5xx IMO
source/common/router/router.cc
Outdated
| if (dropped) { | ||
| upstream_host->stats().rq_dropped_.inc(); | ||
| } else if (response_code == 200) { | ||
| // TODO(htuch): When Envoy has first class support for gRPC on the data plane, we should |
There was a problem hiding this comment.
We are already doing gRPC retry in the router using gRPC status codes, so I think we can just skip straight to doing this if we want.
Signed-off-by: Harvey Tuch <htuch@google.com>
|
@mattklein123 I've added gRPC status code support. Still no tests though. Implementation feedback welcome while I work on some of these. |
Signed-off-by: Harvey Tuch <htuch@google.com>
|
@mattklein123 this is now complete and ready for review. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks generally looks good. Thanks for the really detailed tests. A few comments.
| } | ||
| } | ||
|
|
||
| uint64_t Common::grpcToHttpStatus(Status::GrpcStatus grpc_status) { |
There was a problem hiding this comment.
This actually comes straight from an internal doc on the canonical gRPC -> HTTP mapping. In terms of public documentation, there are HTTP -> gRPC mappings at https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md, but not the other way.
Digging around, it looks like this mapping is reflected in a table at https://cloud.google.com/apis/design/errors#handling_errors.
I will add a comment pointing at that. In terms of test, what are you after? Since this function is basically a table, it seems the unit test would largely resemble the implementation.
There was a problem hiding this comment.
Comment would be great, thanks. A test would probably look like this: https://github.com/envoyproxy/envoy/blob/master/test/common/http/codes_test.cc#L94. I agree that test is dubiously useful, but it does add some value. Up to you.
source/common/router/router.cc
Outdated
| } | ||
|
|
||
| if (upstream_host) { | ||
| const uint64_t response_code = Http::Utility::getResponseStatus(info.response_headers_); |
There was a problem hiding this comment.
Http::Utility::getResponseStatus is actually pretty slow. (I regret that it was implemented this way but I haven't gotten around to figuring out a good fix for this). Small perf nit but if there is a way to call this once instead of multiple times that would be cool. If it's too much of a pain that's fine.
| retry_state_->shouldRetry(nullptr, reset_reason, [this]() -> void { doRetry(); }); | ||
| if (retry_status == RetryStatus::Yes && setupRetry(true)) { | ||
| if (upstream_host) { | ||
| upstream_host->stats().rq_error_.inc(); |
There was a problem hiding this comment.
Retries can happen for non-5xx and gRPC status codes. I think you still need a check here most likely for type of code/event?
There was a problem hiding this comment.
We only get to this code if we haven't started the downstream response, which means we haven't received the upstream response and consequently haven't done any charging of response status code. So, we should bill this as an error, regardless of gRPC.
There was a problem hiding this comment.
Oops yes didn't see this is in reset case.
source/common/router/router.cc
Outdated
| void Filter::onUpstreamData(Buffer::Instance& data, bool end_stream) { | ||
| if (end_stream) { | ||
| // gRPC request termination without trailers is an error. | ||
| if (upstream_request_ != nullptr && upstream_request_->grpc_rq_success_deferred_) { |
There was a problem hiding this comment.
can upstream_request_ be null here?
source/common/router/router.cc
Outdated
| } | ||
|
|
||
| void Filter::onUpstreamTrailers(Http::HeaderMapPtr&& trailers) { | ||
| if (upstream_request_ != nullptr && upstream_request_->grpc_rq_success_deferred_) { |
There was a problem hiding this comment.
Can upstream request be null here?
source/common/router/router.cc
Outdated
|
|
||
| const Http::HeaderEntry* content_type = headers.ContentType(); | ||
| grpc_request_ = content_type != nullptr && | ||
| Http::Headers::get().ContentTypeValues.Grpc == content_type->value().c_str(); |
There was a problem hiding this comment.
There are multiple gRPC content types, so I don't think this is sufficient. I think this might actually need to be a prefix match?
| // We need to defer gRPC success until after we have processed | ||
| // grpc-status in the trailers. | ||
| const uint64_t response_code = Http::Utility::getResponseStatus(*headers); | ||
| if (!Http::CodeUtility::is5xx(response_code)) { |
There was a problem hiding this comment.
nit: is it possible to reduce mega nesting here for readability possible by factoring out to a different function?
source/common/router/router.cc
Outdated
| chargeUpstreamCode(code, upstream_host, | ||
| reset_reason.valid() && | ||
| reset_reason.value() == Http::StreamResetReason::Overflow); | ||
| // If we had non-5xx but still have been reset by backend or timeout, we |
There was a problem hiding this comment.
This is only in the timeout_response_code_ case, right? Can you clarify comment?
There was a problem hiding this comment.
Not sure; can't we be reset by the backend without hitting a timeout, e.g. by having a TCP connection drop?
There was a problem hiding this comment.
Sorry yes, I was talking about looking at code below. I think code can only be non-5xx in the alt-timeout-response case.
There was a problem hiding this comment.
A few lines above, code is set to 503 in the non-timeout case.
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
looks great, thanks. Few more small comments.
source/common/grpc/common.cc
Outdated
| } | ||
| // Exact match with application/grpc. This and the above case are likely the | ||
| // two most common encountered. | ||
| if (Http::Headers::get().ContentTypeValues.Grpc == content_type->value().c_str()) { |
There was a problem hiding this comment.
perf nit: if you invert this and compare value() to the constant string you can avoid the string creation.
source/common/grpc/common.cc
Outdated
| return true; | ||
| } | ||
| // Prefix match with application/grpc+. It's not sufficient to rely on the an | ||
| // applicatin/grpc prefix match, since there are related content types such as |
source/common/grpc/common.cc
Outdated
| // Prefix match with application/grpc+. It's not sufficient to rely on the an | ||
| // applicatin/grpc prefix match, since there are related content types such as | ||
| // application/grpc-web. | ||
| if (StringUtil::startsWith(content_type->value().c_str(), |
There was a problem hiding this comment.
perf nit: you can avoid string creation here by just checking length and doing direct lookup for '+'
source/common/router/router.cc
Outdated
| reset_reason.valid() && | ||
| reset_reason.value() == Http::StreamResetReason::Overflow); | ||
| // If we had non-5xx but still have been reset by backend or timeout before | ||
| // starting response, we treat this as an error. |
There was a problem hiding this comment.
I think we are talking past each other here. I was tying to say that the only time you actually need the following if check is in the alt-timeout-response-code case, otherwise it will always be 5xx, right? I was just suggesting making the comment explicitly say this because to the casual reader this is a little confusing as to why it's not always 5xx in this path.
| if (retry_state_) { | ||
| RetryStatus retry_status = retry_state_->shouldRetry( | ||
| headers.get(), Optional<Http::StreamResetReason>(), [this]() -> void { doRetry(); }); | ||
| const auto upstream_host = upstream_request_->upstream_host_; |
There was a problem hiding this comment.
Q: Why capture local var here?
There was a problem hiding this comment.
setupRetry in the next line will clear upstream_request_, but we need it in the if body. Will add a comment.
Signed-off-by: Harvey Tuch <htuch@google.com>
|
Looking into ASAN failure.. |
Signed-off-by: Harvey Tuch <htuch@google.com>
This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be aggregated on a per-locality basis. Signed-off-by: Harvey Tuch htuch@google.com
Description: Move perf workflow to Engflow's CI and cut build times by 3x Risk Level: Low Testing: See perf workflow Docs Changes: N/A Release Notes: N/A Signed-off-by: Luis Fernando Pino Duque <luis@engflow.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: Move perf workflow to Engflow's CI and cut build times by 3x Risk Level: Low Testing: See perf workflow Docs Changes: N/A Release Notes: N/A Signed-off-by: Luis Fernando Pino Duque <luis@engflow.com> Signed-off-by: JP Simard <jp@jpsim.com>
**Description** Envoy Gateway skips the weight zero backendRef when constructing localityLbEndoints slices. This starts handling such cases gracefully rather than taking them as errors. **Related Issues/PRs (if applicable)** Close #1664 Supersedes #1694 --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be
aggregated on a per-locality basis.
Signed-off-by: Harvey Tuch htuch@google.com