host: track per-host UpstreamLocalityStats. by htuch · Pull Request #1755 · envoyproxy/envoy

htuch · 2017-09-27T14:59:41Z

This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be
aggregated on a per-locality basis.

Signed-off-by: Harvey Tuch htuch@google.com

This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be aggregated on a per-locality basis. No tests yet, looking for feedback on whether semantically we're doing the right thing by using response code to drive this. Signed-off-by: Harvey Tuch <htuch@google.com>

htuch · 2017-09-27T15:00:46Z

@mattklein123 @rohitbhoj for feedback on the approach before going too wild with writing tests.

mattklein123

In general looks good, couple of comments.

mattklein123 · 2017-09-27T16:49:53Z

source/common/router/router.cc

+      const uint64_t response_code = Http::Utility::getResponseStatus(info.response_headers_);
+      if (dropped) {
+        upstream_host->stats().rq_dropped_.inc();
+      } else if (response_code == 200) {


If we are going by normal HTTP semantics, this should be !Http::CodeUtility::is5xx IMO

mattklein123 · 2017-09-27T16:50:30Z

source/common/router/router.cc

+      if (dropped) {
+        upstream_host->stats().rq_dropped_.inc();
+      } else if (response_code == 200) {
+        // TODO(htuch): When Envoy has first class support for gRPC on the data plane, we should


We are already doing gRPC retry in the router using gRPC status codes, so I think we can just skip straight to doing this if we want.

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch · 2017-09-27T22:11:05Z

@mattklein123 I've added gRPC status code support. Still no tests though. Implementation feedback welcome while I work on some of these.

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch · 2017-09-28T21:43:04Z

@mattklein123 this is now complete and ready for review.

mattklein123

Thanks generally looks good. Thanks for the really detailed tests. A few comments.

mattklein123 · 2017-09-29T02:52:06Z

source/common/grpc/common.cc

  }
 }

+uint64_t Common::grpcToHttpStatus(Status::GrpcStatus grpc_status) {


Can we get explicit test coverage of this function? Also it would good if @lizan or @fengli79 could review this part.

This actually comes straight from an internal doc on the canonical gRPC -> HTTP mapping. In terms of public documentation, there are HTTP -> gRPC mappings at https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md, but not the other way.

Digging around, it looks like this mapping is reflected in a table at https://cloud.google.com/apis/design/errors#handling_errors.

I will add a comment pointing at that. In terms of test, what are you after? Since this function is basically a table, it seems the unit test would largely resemble the implementation.

Comment would be great, thanks. A test would probably look like this: https://github.com/envoyproxy/envoy/blob/master/test/common/http/codes_test.cc#L94. I agree that test is dubiously useful, but it does add some value. Up to you.

Mapping LGTM

mattklein123 · 2017-09-29T02:55:37Z

source/common/router/router.cc

    }
+
+    if (upstream_host) {
+      const uint64_t response_code = Http::Utility::getResponseStatus(info.response_headers_);


Http::Utility::getResponseStatus is actually pretty slow. (I regret that it was implemented this way but I haven't gotten around to figuring out a good fix for this). Small perf nit but if there is a way to call this once instead of multiple times that would be cool. If it's too much of a pain that's fine.

mattklein123 · 2017-09-29T02:58:04Z

source/common/router/router.cc

        retry_state_->shouldRetry(nullptr, reset_reason, [this]() -> void { doRetry(); });
    if (retry_status == RetryStatus::Yes && setupRetry(true)) {
+      if (upstream_host) {
+        upstream_host->stats().rq_error_.inc();


Retries can happen for non-5xx and gRPC status codes. I think you still need a check here most likely for type of code/event?

We only get to this code if we haven't started the downstream response, which means we haven't received the upstream response and consequently haven't done any charging of response status code. So, we should bill this as an error, regardless of gRPC.

Oops yes didn't see this is in reset case.

mattklein123 · 2017-09-29T02:59:55Z

source/common/router/router.cc

 void Filter::onUpstreamData(Buffer::Instance& data, bool end_stream) {
  if (end_stream) {
+    // gRPC request termination without trailers is an error.
+    if (upstream_request_ != nullptr && upstream_request_->grpc_rq_success_deferred_) {


can upstream_request_ be null here?

mattklein123 · 2017-09-29T03:00:06Z

source/common/router/router.cc

 }

 void Filter::onUpstreamTrailers(Http::HeaderMapPtr&& trailers) {
+  if (upstream_request_ != nullptr && upstream_request_->grpc_rq_success_deferred_) {


Can upstream request be null here?

mattklein123 · 2017-09-29T03:02:32Z

source/common/router/router.cc


+  const Http::HeaderEntry* content_type = headers.ContentType();
+  grpc_request_ = content_type != nullptr &&
+                  Http::Headers::get().ContentTypeValues.Grpc == content_type->value().c_str();


There are multiple gRPC content types, so I don't think this is sufficient. I think this might actually need to be a prefix match?

mattklein123 · 2017-09-29T03:03:32Z

source/common/router/router.cc

+  // We need to defer gRPC success until after we have processed
+  // grpc-status in the trailers.
+  const uint64_t response_code = Http::Utility::getResponseStatus(*headers);
+  if (!Http::CodeUtility::is5xx(response_code)) {


nit: is it possible to reduce mega nesting here for readability possible by factoring out to a different function?

mattklein123 · 2017-09-29T03:05:58Z

source/common/router/router.cc

+    chargeUpstreamCode(code, upstream_host,
+                       reset_reason.valid() &&
+                           reset_reason.value() == Http::StreamResetReason::Overflow);
+    // If we had non-5xx but still have been reset by backend or timeout, we


This is only in the timeout_response_code_ case, right? Can you clarify comment?

Not sure; can't we be reset by the backend without hitting a timeout, e.g. by having a TCP connection drop?

Sorry yes, I was talking about looking at code below. I think code can only be non-5xx in the alt-timeout-response case.

A few lines above, code is set to 503 in the non-timeout case.

Signed-off-by: Harvey Tuch <htuch@google.com>

mattklein123

looks great, thanks. Few more small comments.

mattklein123 · 2017-09-29T16:28:06Z

source/common/grpc/common.cc

+  }
+  // Exact match with application/grpc. This and the above case are likely the
+  // two most common encountered.
+  if (Http::Headers::get().ContentTypeValues.Grpc == content_type->value().c_str()) {


perf nit: if you invert this and compare value() to the constant string you can avoid the string creation.

mattklein123 · 2017-09-29T16:28:33Z

source/common/grpc/common.cc

+    return true;
+  }
+  // Prefix match with application/grpc+. It's not sufficient to rely on the an
+  // applicatin/grpc prefix match, since there are related content types such as


typo applicatin

mattklein123 · 2017-09-29T16:29:20Z

source/common/grpc/common.cc

+  // Prefix match with application/grpc+. It's not sufficient to rely on the an
+  // applicatin/grpc prefix match, since there are related content types such as
+  // application/grpc-web.
+  if (StringUtil::startsWith(content_type->value().c_str(),


perf nit: you can avoid string creation here by just checking length and doing direct lookup for '+'

mattklein123 · 2017-09-29T16:33:12Z

source/common/router/router.cc

+                       reset_reason.valid() &&
+                           reset_reason.value() == Http::StreamResetReason::Overflow);
+    // If we had non-5xx but still have been reset by backend or timeout before
+    // starting response, we treat this as an error.


I think we are talking past each other here. I was tying to say that the only time you actually need the following if check is in the alt-timeout-response-code case, otherwise it will always be 5xx, right? I was just suggesting making the comment explicitly say this because to the casual reader this is a little confusing as to why it's not always 5xx in this path.

mattklein123 · 2017-09-29T16:34:18Z

source/common/router/router.cc

  if (retry_state_) {
    RetryStatus retry_status = retry_state_->shouldRetry(
        headers.get(), Optional<Http::StreamResetReason>(), [this]() -> void { doRetry(); });
+    const auto upstream_host = upstream_request_->upstream_host_;


Q: Why capture local var here?

setupRetry in the next line will clear upstream_request_, but we need it in the if body. Will add a comment.

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch · 2017-09-29T17:45:32Z

Looking into ASAN failure..

Signed-off-by: Harvey Tuch <htuch@google.com>

This PR adds per-host counters to support UpstreamLocalityStats. At load report time, these will be aggregated on a per-locality basis. Signed-off-by: Harvey Tuch htuch@google.com

Description: Move perf workflow to Engflow's CI and cut build times by 3x Risk Level: Low Testing: See perf workflow Docs Changes: N/A Release Notes: N/A Signed-off-by: Luis Fernando Pino Duque <luis@engflow.com> Signed-off-by: JP Simard <jp@jpsim.com>

**Description** Envoy Gateway skips the weight zero backendRef when constructing localityLbEndoints slices. This starts handling such cases gracefully rather than taking them as errors. **Related Issues/PRs (if applicable)** Close #1664 Supersedes #1694 --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

mattklein123 reviewed Sep 27, 2017

View reviewed changes

Review feedback and support for gRPC status codes.

c453987

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch changed the title ~~[RFC] host: track per-host UpstreamLocalityStats.~~ host: track per-host UpstreamLocalityStats. Sep 27, 2017

htuch added 3 commits September 28, 2017 15:29

wip

b75fd3d

Signed-off-by: Harvey Tuch <htuch@google.com>

Merge remote-tracking branch 'upstream/master' into host-load-stats

d246f1c

Unit tests and various fixes encountered while writing tests.

2596f30

Signed-off-by: Harvey Tuch <htuch@google.com>

mattklein123 reviewed Sep 29, 2017

View reviewed changes

htuch added 6 commits September 29, 2017 06:56

Merge remote-tracking branch 'upstream/master' into host-load-stats

aa95a2e

Comment on gRPC -> HTTP canonical status code mapping.

c5f72a7

Signed-off-by: Harvey Tuch <htuch@google.com>

Optimize Http::Utility::getResponseStatus() on fast path.

f794e18

Signed-off-by: Harvey Tuch <htuch@google.com>

Handle application/grpc[+] prefix content-types.

bc200df

Signed-off-by: Harvey Tuch <htuch@google.com>

Misc. review feedback.

03fda54

Signed-off-by: Harvey Tuch <htuch@google.com>

Unit test for gRPC -> HTTP status code translation.

f8c3ecd

Signed-off-by: Harvey Tuch <htuch@google.com>

mattklein123 reviewed Sep 29, 2017

View reviewed changes

Further review feedback.

f7eff11

Signed-off-by: Harvey Tuch <htuch@google.com>

Remove undefined behavior in //test/common/grpc:common_test.

3d9fae9

Signed-off-by: Harvey Tuch <htuch@google.com>

mattklein123 approved these changes Sep 29, 2017

View reviewed changes

htuch merged commit 95b6cfe into envoyproxy:master Sep 29, 2017

htuch deleted the host-load-stats branch September 29, 2017 18:19

Conversation

htuch commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

htuch commented Sep 27, 2017

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch commented Sep 27, 2017

Uh oh!

htuch commented Sep 28, 2017

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch Sep 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch commented Sep 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

htuch commented Sep 27, 2017 •

edited

Loading

htuch Sep 29, 2017 •

edited

Loading