Skip to content

host: track per-host UpstreamLocalityStats.#1755

Merged
htuch merged 13 commits intoenvoyproxy:masterfrom
htuch:host-load-stats
Sep 29, 2017
Merged

host: track per-host UpstreamLocalityStats.#1755
htuch merged 13 commits intoenvoyproxy:masterfrom
htuch:host-load-stats

Conversation

@htuch
Copy link
Copy Markdown
Member

@htuch htuch commented Sep 27, 2017

This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be
aggregated on a per-locality basis.

Signed-off-by: Harvey Tuch htuch@google.com

This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be
aggregated on a per-locality basis.

No tests yet, looking for feedback on whether semantically we're doing the right thing by using
response code to drive this.

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Sep 27, 2017

@mattklein123 @rohitbhoj for feedback on the approach before going too wild with writing tests.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good, couple of comments.

const uint64_t response_code = Http::Utility::getResponseStatus(info.response_headers_);
if (dropped) {
upstream_host->stats().rq_dropped_.inc();
} else if (response_code == 200) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going by normal HTTP semantics, this should be !Http::CodeUtility::is5xx IMO

if (dropped) {
upstream_host->stats().rq_dropped_.inc();
} else if (response_code == 200) {
// TODO(htuch): When Envoy has first class support for gRPC on the data plane, we should
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already doing gRPC retry in the router using gRPC status codes, so I think we can just skip straight to doing this if we want.

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch htuch changed the title [RFC] host: track per-host UpstreamLocalityStats. host: track per-host UpstreamLocalityStats. Sep 27, 2017
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Sep 27, 2017

@mattklein123 I've added gRPC status code support. Still no tests though. Implementation feedback welcome while I work on some of these.

@htuch
Copy link
Copy Markdown
Member Author

htuch commented Sep 28, 2017

@mattklein123 this is now complete and ready for review.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks generally looks good. Thanks for the really detailed tests. A few comments.

}
}

uint64_t Common::grpcToHttpStatus(Status::GrpcStatus grpc_status) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get explicit test coverage of this function? Also it would good if @lizan or @fengli79 could review this part.

Copy link
Copy Markdown
Member Author

@htuch htuch Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually comes straight from an internal doc on the canonical gRPC -> HTTP mapping. In terms of public documentation, there are HTTP -> gRPC mappings at https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md, but not the other way.

Digging around, it looks like this mapping is reflected in a table at https://cloud.google.com/apis/design/errors#handling_errors.

I will add a comment pointing at that. In terms of test, what are you after? Since this function is basically a table, it seems the unit test would largely resemble the implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment would be great, thanks. A test would probably look like this: https://github.com/envoyproxy/envoy/blob/master/test/common/http/codes_test.cc#L94. I agree that test is dubiously useful, but it does add some value. Up to you.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping LGTM

}

if (upstream_host) {
const uint64_t response_code = Http::Utility::getResponseStatus(info.response_headers_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Http::Utility::getResponseStatus is actually pretty slow. (I regret that it was implemented this way but I haven't gotten around to figuring out a good fix for this). Small perf nit but if there is a way to call this once instead of multiple times that would be cool. If it's too much of a pain that's fine.

retry_state_->shouldRetry(nullptr, reset_reason, [this]() -> void { doRetry(); });
if (retry_status == RetryStatus::Yes && setupRetry(true)) {
if (upstream_host) {
upstream_host->stats().rq_error_.inc();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries can happen for non-5xx and gRPC status codes. I think you still need a check here most likely for type of code/event?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only get to this code if we haven't started the downstream response, which means we haven't received the upstream response and consequently haven't done any charging of response status code. So, we should bill this as an error, regardless of gRPC.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops yes didn't see this is in reset case.

void Filter::onUpstreamData(Buffer::Instance& data, bool end_stream) {
if (end_stream) {
// gRPC request termination without trailers is an error.
if (upstream_request_ != nullptr && upstream_request_->grpc_rq_success_deferred_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can upstream_request_ be null here?

}

void Filter::onUpstreamTrailers(Http::HeaderMapPtr&& trailers) {
if (upstream_request_ != nullptr && upstream_request_->grpc_rq_success_deferred_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can upstream request be null here?


const Http::HeaderEntry* content_type = headers.ContentType();
grpc_request_ = content_type != nullptr &&
Http::Headers::get().ContentTypeValues.Grpc == content_type->value().c_str();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple gRPC content types, so I don't think this is sufficient. I think this might actually need to be a prefix match?

// We need to defer gRPC success until after we have processed
// grpc-status in the trailers.
const uint64_t response_code = Http::Utility::getResponseStatus(*headers);
if (!Http::CodeUtility::is5xx(response_code)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to reduce mega nesting here for readability possible by factoring out to a different function?

chargeUpstreamCode(code, upstream_host,
reset_reason.valid() &&
reset_reason.value() == Http::StreamResetReason::Overflow);
// If we had non-5xx but still have been reset by backend or timeout, we
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only in the timeout_response_code_ case, right? Can you clarify comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure; can't we be reset by the backend without hitting a timeout, e.g. by having a TCP connection drop?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yes, I was talking about looking at code below. I think code can only be non-5xx in the alt-timeout-response case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines above, code is set to 503 in the non-timeout case.

Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks. Few more small comments.

}
// Exact match with application/grpc. This and the above case are likely the
// two most common encountered.
if (Http::Headers::get().ContentTypeValues.Grpc == content_type->value().c_str()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf nit: if you invert this and compare value() to the constant string you can avoid the string creation.

return true;
}
// Prefix match with application/grpc+. It's not sufficient to rely on the an
// applicatin/grpc prefix match, since there are related content types such as
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo applicatin

// Prefix match with application/grpc+. It's not sufficient to rely on the an
// applicatin/grpc prefix match, since there are related content types such as
// application/grpc-web.
if (StringUtil::startsWith(content_type->value().c_str(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf nit: you can avoid string creation here by just checking length and doing direct lookup for '+'

reset_reason.valid() &&
reset_reason.value() == Http::StreamResetReason::Overflow);
// If we had non-5xx but still have been reset by backend or timeout before
// starting response, we treat this as an error.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking past each other here. I was tying to say that the only time you actually need the following if check is in the alt-timeout-response-code case, otherwise it will always be 5xx, right? I was just suggesting making the comment explicitly say this because to the casual reader this is a little confusing as to why it's not always 5xx in this path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

if (retry_state_) {
RetryStatus retry_status = retry_state_->shouldRetry(
headers.get(), Optional<Http::StreamResetReason>(), [this]() -> void { doRetry(); });
const auto upstream_host = upstream_request_->upstream_host_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why capture local var here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupRetry in the next line will clear upstream_request_, but we need it in the if body. Will add a comment.

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented Sep 29, 2017

Looking into ASAN failure..

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch htuch merged commit 95b6cfe into envoyproxy:master Sep 29, 2017
@htuch htuch deleted the host-load-stats branch September 29, 2017 18:19
costinm pushed a commit to costinm/envoy that referenced this pull request Oct 2, 2017
This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be
aggregated on a per-locality basis.

Signed-off-by: Harvey Tuch htuch@google.com
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: Move perf workflow to Engflow's CI and cut build times by 3x
Risk Level: Low
Testing: See perf workflow
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Luis Fernando Pino Duque <luis@engflow.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: Move perf workflow to Engflow's CI and cut build times by 3x
Risk Level: Low
Testing: See perf workflow
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Luis Fernando Pino Duque <luis@engflow.com>
Signed-off-by: JP Simard <jp@jpsim.com>
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

Envoy Gateway skips the weight zero backendRef when constructing
localityLbEndoints slices. This starts handling such cases gracefully
rather than taking them as errors.

**Related Issues/PRs (if applicable)**
Close #1664
Supersedes #1694

---------

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants