Skip to content

Native gRPC health checker implementation#2316

Merged
mattklein123 merged 8 commits intoenvoyproxy:masterfrom
baranov1ch:grpc-health-checker
Jan 16, 2018
Merged

Native gRPC health checker implementation#2316
mattklein123 merged 8 commits intoenvoyproxy:masterfrom
baranov1ch:grpc-health-checker

Conversation

@baranov1ch
Copy link
Copy Markdown
Contributor

Signed-off-by: Alexey Baranov me@kotiki.cc

title: gRPC health checker

Description:
This PR introduces gRPC-based health checking cluster hosts that implement grpc.health.v1.Health. New grpc_health_check option is added to health_checks section of CDS v2.Cluster config. User can also set service_name which will be passed in HealthCheckRequest.service in Check RPC request payload. Cluster MUST be configured to support HTTP/2 to use gRPC health check, otherwise exception is thrown during configuration.

Risk Level: Medium

Testing:
Rather comprehensive (imo :)) unit testing for configuration and operation. Also it was manually tested against grpc-go based server to catch a lot of graceful shutdown edge cases (see long comments in GrpcHealthCheckerImpl::GrpcActiveHealthCheckSession).

Docs Changes:
envoyproxy/data-plane-api#383

Release Notes:
Added gRPC healthcheck based on grpc.health.v1.Health service.

Fixes #369

API Changes:
envoyproxy/data-plane-api#383

@baranov1ch
Copy link
Copy Markdown
Contributor Author

I'm not sure how timed out //test/integration:access_log_integration_test relates to my changes. Plus IWOMM

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be sourced from the gRPC repo as an external dependency. I'm going to be looking into adding the gRPC repos as an external dep for Envoy today, so stay tuned.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2318. When this merges, you should be able to do something like @grpc//src/proto/grpc/health/v1:health.proto to reference the proto there.

Copy link
Copy Markdown
Contributor Author

@baranov1ch baranov1ch Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying like this:

--- a/bazel/repositories.bzl
+++ b/bazel/repositories.bzl
@@ -410,3 +410,18 @@ def _com_github_grpc_grpc():
       name = "grpc",
       actual = "@com_github_grpc_grpc//:grpc++"
     )
+    // For grpc proto compilation to work.
+    native.bind(
+        name = "protocol_compiler",
+        actual = "@com_google_protobuf_cc//:protoc",
+    )
+
+    // For grpc proto compilation to work.
+    native.bind(
+        name = "protobuf_clib",
+        actual = "@com_google_protobuf_cc//:protobuf",
+    )
+
+    native.bind(
+      name = "grpc_health_proto",
+      actual = "@com_github_grpc_grpc//src/proto/grpc/health/v1:health_proto",
+    )


--- a/source/common/upstream/BUILD
+++ b/source/common/upstream/BUILD
@@ -89,9 +89,8 @@ envoy_cc_library(
     name = "health_checker_lib",
     srcs = ["health_checker_impl.cc"],
     hdrs = ["health_checker_impl.h"],
-    external_deps = ["envoy_health_check"],
+    external_deps = ["envoy_health_check", "grpc_health_proto"],

does not work with some strange errors:

bazel build -c dbg --verbose_failures //source/common/upstream:health_checker_lib
<...>
INFO: Analysed target //source/common/upstream:health_checker_lib (0 packages loaded).
INFO: Found 1 target...
ERROR: /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/external/com_github_grpc_grpc/src/proto/grpc/health/v1/BUILD:21:1: output 'external/com_github_grpc_grpc/src/proto/grpc/health/v1/grpc/src/proto/grpc/health/v1/health.pb.h' was not created
ERROR: /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/external/com_github_grpc_grpc/src/proto/grpc/health/v1/BUILD:21:1: output 'external/com_github_grpc_grpc/src/proto/grpc/health/v1/grpc/src/proto/grpc/health/v1/health.pb.cc' was not created
ERROR: /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/external/com_github_grpc_grpc/src/proto/grpc/health/v1/BUILD:21:1: not all outputs were created or valid
Target //source/common/upstream:health_checker_lib failed to build
INFO: Elapsed time: 0,506s, Critical Path: 0,30s
FAILED: Build did NOT complete successfully

Seems like grpc proto build rules does smth strange with proto out paths...

I wasn't able to build any of external grpc targets building protobufs with the same kind of errors(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to know what's going on with just that error message. The diff you provide looks correct. Can you do bazel build -s and attach the protoc invocation on health.proto? This might give more of a clue about what is going on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUBCOMMAND: # @com_github_grpc_grpc//src/proto/grpc/health/v1:_health_proto_codegen [action 'SkylarkAction external/com_github_grpc_grpc/src/proto/grpc/health/v1/grpc/src/proto/grpc/health/v1/health.pb.h']
(cd /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/execroot/envoy && \
  exec env - \
  bazel-out/host/bin/external/com_google_protobuf_cc/protoc '--cpp_out=:bazel-out/darwin-dbg/genfiles' '-Iexternal/com_github_grpc_grpc/src/proto/grpc/health/v1/health.proto=external/com_github_grpc_grpc/src/proto/grpc/health/v1/health.proto' external/com_github_grpc_grpc/src/proto/grpc/health/v1/health.proto)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going into the review, I'd like to understand some design stuff. How come we're doing all the low level gRPC codec and connection management in this HC rather than leveraging the existing gRPC client support in Envoy (https://github.com/envoyproxy/envoy/blob/master/include/envoy/grpc/async_client.h, https://github.com/envoyproxy/envoy/blob/master/source/common/grpc/async_client_impl.h)?

Copy link
Copy Markdown
Contributor Author

@baranov1ch baranov1ch Jan 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grpc::AsyncClientImpl was my first thought. Unfortunately async client is tied to cluster manager, router, retries, AsyncHttpClient, etc. We don't have (and don't need) all those abstractions in healthchecker code - we need to connect directly to a specific endpoint.

Other option is to refactor out connection establishment part from current Grpc::AsyncStreamImpl, leaving it as a dumb protocol decoder, but that requires careful thinking and greatly widens scope of this pretty germetic PR, which I would like to avoid and work on tech debt later. Imo, low-level protocol stuff here is pretty simple.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I think this is fine then, will take a deeper look in a bit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the issue of doing it this way vs. AsyncClient came up when @craffert0 took a quick stab at implementing this a couple of months ago. I think it's definitely possible to refactor AsyncClient so it doesn't depend on cluster manager, but it's non-trivial. I'm in favor of doing it this way and adding a TODO to come back and revisit at a later time.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, here's my first pass of the code (not tests yet).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: slight preference for "HTTP/2" over "http2" here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: const auto

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't gRPC always 200? If it's non-200, it seems like we shouldn't be expecting grpc-status at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code was taken from grpc_async_client_impl.h. I haven't managed to receive such a thing from grpc-go server, but maybe it can exist

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from Grpc::AsyncClentImpl. I wasn't able to get to this codepath from grpc-go server, but maybe it can happen with more intermediaries en route.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough. I actually wrote that code, and suspect that it's gratuitous, but let's leave it as is for consistency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a gRPC protocol violation, so maybe be a bit more explicit about that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do here then... there is no more suitable grpc status. I can add "protocol violation: " to error message.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just add to the error message.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: const bool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, would also rename to expected_reset when snapshotting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use braces even for single line if.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does any of this differ from other cases where we need to handle GOAWAYs and connection closes on Envoy's data or control planes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other HTTP/2 cases all go through cluster manager and Http2::ConnPool which handles remote goaways implementing Http::ConnectionCallbacks by closing the client. I'd better use it here too, thanks for pointing out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got much cleaner after using onGoAway callback

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use braces on single line if

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a bit confusing to me when comparing with Grpc::AsyncClient. Can we use the same close local/remote as done there?

Copy link
Copy Markdown
Contributor Author

@baranov1ch baranov1ch Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I decided not to use local/remote close, logic is simpler:

  • 'local' is always closed, RPC is unary, everything is sent by onInterval()
  • end_stream argument is passed to onRpcComplete instead of using a field indicating 'remote close'.
    Here local/remote close logic is not necessary, since there are no clients trying to send/receive streams of grpc messages.
  • besides local/remote close, there is http_reset_ in Grpc::AsyncClient which is used to determine if Http::AsyncClient::Stream::reset() should be called to recycle http stream, or stream was already reset by http stack. expect_reset_ has nearly the same semantics as http_reset_. But unlike Grpc::AsyncClient, which maps stream reset just to internal error, health checker distinguishes between active healthcheck errors (when service is in really bad mood) and network-ish healthcheck errors produced by stream reset.

I'm not sure if introducing all Grpc::AsyncClient state would make code more readable, but I will gladly add more comments about this flag:)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I think it would be helpful to comment here on why we might have seen an RPC completion, without hitting the end-of-stream, and why this might lead to a following onResetStream() callback. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments. PTAL if they are good enough

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these are good, thanks. Maybe also explain how we ever get into the situation where we are in this branch, i.e. who called onRpcComplete with !end_stream as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing a bunch of work above that isn't strictly needed when log level is above debug. Is it possible to condition on log level?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found how to do that, but I'm quite new to the codebase. Would appreciate any help :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked with @mrice32, this doesn't exist today unfortunately. I've opened #2330 to track.

Signed-off-by: Alexey Baranov <me@kotiki.cc>
Signed-off-by: Alexey Baranov <me@kotiki.cc>
Catching GOAWAYs this way allows not writing a lot of nullptr checks

Signed-off-by: Alexey Baranov <me@kotiki.cc>
@baranov1ch baranov1ch force-pushed the grpc-health-checker branch from 650436c to 6edcca9 Compare January 9, 2018 20:14
@baranov1ch
Copy link
Copy Markdown
Contributor Author

baranov1ch commented Jan 9, 2018

oops, did rebase( Promise to to do it anymore)

Signed-off-by: Alexey Baranov <me@kotiki.cc>
@baranov1ch baranov1ch force-pushed the grpc-health-checker branch from 6edcca9 to b40cce6 Compare January 9, 2018 20:28
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, in general this is extremely well done. Great work.

A few small comments and questions about coverage.

Redis::ConnPool::ClientFactoryImpl::instance_);
case envoy::api::v2::HealthCheck::HealthCheckerCase::kGrpcHealthCheck:
if (!(cluster.info()->features() & Upstream::ClusterInfo::Features::HTTP2)) {
throw EnvoyException(fmt::format("{} cluster must support HTTP/2 for gRPC healthchecking",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a test that does EXPECT_THROW_WITH_MESSAGE to catch this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

struct GrpcActiveHealthCheckSession : public ActiveHealthCheckSession,
public Http::StreamDecoder,
public Http::StreamCallbacks {
GrpcActiveHealthCheckSession(GrpcHealthCheckerImpl& parent, HostSharedPtr host);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const HostSharedPtr& host (Please fix elsewhere in this file in old code if similar).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// HealthCheckerImplBase
ActiveHealthCheckSessionPtr makeSession(HostSharedPtr host) override {
return ActiveHealthCheckSessionPtr{new GrpcActiveHealthCheckSession(*this, host)};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::make_unique (please fix elsewhere in this file for similar code, since we are using C++14 now).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else {
grpc_status_message = fmt::format("{}", grpc_status);
}
ENVOY_CONN_LOG(error, "hc grpc_status={} service_status={} health_flags={}", *client_,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should actually be debug level. (Since upstream can error and we don't want to spew logs by default).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Just forgot to return back after testing

if (!decoder_.decode(data, decoded_frames)) {
onRpcComplete(Grpc::Status::GrpcStatus::Internal, "gRPC wire protocol decode error", false);
}
for (auto& frame : decoded_frames) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a test with the response split across two decodeData calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the test

}
}

void GrpcHealthCheckerImpl::GrpcActiveHealthCheckSession::decodeHeaders(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the coverage results yet in CI, but please check and make sure you have coverage on all error branches in the code you added.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked, added more coverage

Signed-off-by: Alexey Baranov <me@kotiki.cc>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this LGTM other than sorting out proto build issues. @htuch do you have time to check out the branch and see what might be wrong? If not @jmillikin-stripe do you have a little time?

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well with some small suggestions on how to improve test readability (which are amazing BTW, that's some thorough testing!).

I'd like to avoid duplicating health.proto from the gRPC dep, so we should figure out why this isn't building. I'd have to poke around at the Bazel build dir and better understand the grpc_proto_library rule to do this. I'm fairly busy this week, but could look EOW if someone doesn't beat me to it.

}

void respondResponseSpec(size_t index, ResponseSpec&& spec) {
bool trailers_empty = spec.trailers.size() == 0U;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you constify the local tmps where they are immutable everywhere in the tests? Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

EXPECT_TRUE(cluster_->prioritySet().getMockHostSet(0)->hosts_[0]->healthy());
}

TEST_F(GrpcHealthCheckerImplTest, SuccessResponseSplitBetweenChunks) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally prefer a single line // comment above each test explaining in plain language what the test is trying to accomplish. E.g. // Validate that a regular response split across chunks is handled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

EXPECT_CALL(*test_sessions_[0]->timeout_timer_, enableTimer(_));
health_checker_->start();

EXPECT_CALL(*test_sessions_[0]->interval_timer_, enableTimer(_));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be factored out into some expect function for healthy()? Seems very verbose to have this enable/disable timer stuff everywhere, which makes it a bit harder to read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

expectSessionCreate();
expectStreamCreate(0);
EXPECT_CALL(*test_sessions_[0]->timeout_timer_, enableTimer(_));
health_checker_->start();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is also some common setup code that could be factored out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Factor out some helpers to make test bodies more readable.
- Small explanation comments for each test

Signed-off-by: Alexey Baranov <me@kotiki.cc>
@htuch
Copy link
Copy Markdown
Member

htuch commented Jan 12, 2018

Thanks for the test cleanups, this is rad. If we can get the health.proto build sorted this LGTM.

@htuch
Copy link
Copy Markdown
Member

htuch commented Jan 15, 2018

@baranov1ch I looked into the health.proto build issue, fix is at https://gist.github.com/htuch/ff95a662d3c5684eb98d005f6bbbdb8e.

Signed-off-by: Alexey Baranov <me@kotiki.cc>
@baranov1ch
Copy link
Copy Markdown
Contributor Author

@htuch thanks, it worked. I wish I could understand why :)

@mattklein123 mattklein123 merged commit bc67c6e into envoyproxy:master Jan 16, 2018
Shikugawa pushed a commit to Shikugawa/envoy that referenced this pull request Mar 28, 2020
* Add id to header exchange

* remove extra include

* review comments

* review comments2
jpsim added a commit that referenced this pull request Nov 28, 2022
jpsim added a commit that referenced this pull request Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants