Native gRPC health checker implementation#2316
Native gRPC health checker implementation#2316mattklein123 merged 8 commits intoenvoyproxy:masterfrom
Conversation
|
I'm not sure how timed out |
source/common/upstream/health.proto
Outdated
There was a problem hiding this comment.
I think this should be sourced from the gRPC repo as an external dependency. I'm going to be looking into adding the gRPC repos as an external dep for Envoy today, so stay tuned.
There was a problem hiding this comment.
See #2318. When this merges, you should be able to do something like @grpc//src/proto/grpc/health/v1:health.proto to reference the proto there.
There was a problem hiding this comment.
trying like this:
--- a/bazel/repositories.bzl
+++ b/bazel/repositories.bzl
@@ -410,3 +410,18 @@ def _com_github_grpc_grpc():
name = "grpc",
actual = "@com_github_grpc_grpc//:grpc++"
)
+ // For grpc proto compilation to work.
+ native.bind(
+ name = "protocol_compiler",
+ actual = "@com_google_protobuf_cc//:protoc",
+ )
+
+ // For grpc proto compilation to work.
+ native.bind(
+ name = "protobuf_clib",
+ actual = "@com_google_protobuf_cc//:protobuf",
+ )
+
+ native.bind(
+ name = "grpc_health_proto",
+ actual = "@com_github_grpc_grpc//src/proto/grpc/health/v1:health_proto",
+ )
--- a/source/common/upstream/BUILD
+++ b/source/common/upstream/BUILD
@@ -89,9 +89,8 @@ envoy_cc_library(
name = "health_checker_lib",
srcs = ["health_checker_impl.cc"],
hdrs = ["health_checker_impl.h"],
- external_deps = ["envoy_health_check"],
+ external_deps = ["envoy_health_check", "grpc_health_proto"],
does not work with some strange errors:
bazel build -c dbg --verbose_failures //source/common/upstream:health_checker_lib
<...>
INFO: Analysed target //source/common/upstream:health_checker_lib (0 packages loaded).
INFO: Found 1 target...
ERROR: /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/external/com_github_grpc_grpc/src/proto/grpc/health/v1/BUILD:21:1: output 'external/com_github_grpc_grpc/src/proto/grpc/health/v1/grpc/src/proto/grpc/health/v1/health.pb.h' was not created
ERROR: /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/external/com_github_grpc_grpc/src/proto/grpc/health/v1/BUILD:21:1: output 'external/com_github_grpc_grpc/src/proto/grpc/health/v1/grpc/src/proto/grpc/health/v1/health.pb.cc' was not created
ERROR: /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/external/com_github_grpc_grpc/src/proto/grpc/health/v1/BUILD:21:1: not all outputs were created or valid
Target //source/common/upstream:health_checker_lib failed to build
INFO: Elapsed time: 0,506s, Critical Path: 0,30s
FAILED: Build did NOT complete successfully
Seems like grpc proto build rules does smth strange with proto out paths...
I wasn't able to build any of external grpc targets building protobufs with the same kind of errors(
There was a problem hiding this comment.
It's hard to know what's going on with just that error message. The diff you provide looks correct. Can you do bazel build -s and attach the protoc invocation on health.proto? This might give more of a clue about what is going on.
There was a problem hiding this comment.
SUBCOMMAND: # @com_github_grpc_grpc//src/proto/grpc/health/v1:_health_proto_codegen [action 'SkylarkAction external/com_github_grpc_grpc/src/proto/grpc/health/v1/grpc/src/proto/grpc/health/v1/health.pb.h']
(cd /private/var/tmp/_bazel_baranovich/b62ff95190308ee02bfc031330094506/execroot/envoy && \
exec env - \
bazel-out/host/bin/external/com_google_protobuf_cc/protoc '--cpp_out=:bazel-out/darwin-dbg/genfiles' '-Iexternal/com_github_grpc_grpc/src/proto/grpc/health/v1/health.proto=external/com_github_grpc_grpc/src/proto/grpc/health/v1/health.proto' external/com_github_grpc_grpc/src/proto/grpc/health/v1/health.proto)
There was a problem hiding this comment.
Before going into the review, I'd like to understand some design stuff. How come we're doing all the low level gRPC codec and connection management in this HC rather than leveraging the existing gRPC client support in Envoy (https://github.com/envoyproxy/envoy/blob/master/include/envoy/grpc/async_client.h, https://github.com/envoyproxy/envoy/blob/master/source/common/grpc/async_client_impl.h)?
There was a problem hiding this comment.
Grpc::AsyncClientImpl was my first thought. Unfortunately async client is tied to cluster manager, router, retries, AsyncHttpClient, etc. We don't have (and don't need) all those abstractions in healthchecker code - we need to connect directly to a specific endpoint.
Other option is to refactor out connection establishment part from current Grpc::AsyncStreamImpl, leaving it as a dumb protocol decoder, but that requires careful thinking and greatly widens scope of this pretty germetic PR, which I would like to avoid and work on tech debt later. Imo, low-level protocol stuff here is pretty simple.
There was a problem hiding this comment.
Got it, I think this is fine then, will take a deeper look in a bit.
There was a problem hiding this comment.
FWIW the issue of doing it this way vs. AsyncClient came up when @craffert0 took a quick stab at implementing this a couple of months ago. I think it's definitely possible to refactor AsyncClient so it doesn't depend on cluster manager, but it's non-trivial. I'm in favor of doing it this way and adding a TODO to come back and revisit at a later time.
htuch
left a comment
There was a problem hiding this comment.
Thanks for the contribution, here's my first pass of the code (not tests yet).
There was a problem hiding this comment.
Nit: slight preference for "HTTP/2" over "http2" here.
There was a problem hiding this comment.
Isn't gRPC always 200? If it's non-200, it seems like we shouldn't be expecting grpc-status at all.
There was a problem hiding this comment.
this code was taken from grpc_async_client_impl.h. I haven't managed to receive such a thing from grpc-go server, but maybe it can exist
There was a problem hiding this comment.
This is copied from Grpc::AsyncClentImpl. I wasn't able to get to this codepath from grpc-go server, but maybe it can happen with more intermediaries en route.
There was a problem hiding this comment.
Ah, fair enough. I actually wrote that code, and suspect that it's gratuitous, but let's leave it as is for consistency.
There was a problem hiding this comment.
This is a gRPC protocol violation, so maybe be a bit more explicit about that.
There was a problem hiding this comment.
Not sure what to do here then... there is no more suitable grpc status. I can add "protocol violation: " to error message.
There was a problem hiding this comment.
Yeah, just add to the error message.
There was a problem hiding this comment.
Actually, would also rename to expected_reset when snapshotting.
There was a problem hiding this comment.
Nit: use braces even for single line if.
There was a problem hiding this comment.
How does any of this differ from other cases where we need to handle GOAWAYs and connection closes on Envoy's data or control planes?
There was a problem hiding this comment.
Other HTTP/2 cases all go through cluster manager and Http2::ConnPool which handles remote goaways implementing Http::ConnectionCallbacks by closing the client. I'd better use it here too, thanks for pointing out.
There was a problem hiding this comment.
It got much cleaner after using onGoAway callback
There was a problem hiding this comment.
This logic is a bit confusing to me when comparing with Grpc::AsyncClient. Can we use the same close local/remote as done there?
There was a problem hiding this comment.
here I decided not to use local/remote close, logic is simpler:
- 'local' is always closed, RPC is unary, everything is sent by
onInterval() end_streamargument is passed toonRpcCompleteinstead of using a field indicating 'remote close'.
Here local/remote close logic is not necessary, since there are no clients trying to send/receive streams of grpc messages.- besides local/remote close, there is
http_reset_inGrpc::AsyncClientwhich is used to determine ifHttp::AsyncClient::Stream::reset()should be called to recycle http stream, or stream was already reset by http stack.expect_reset_has nearly the same semantics ashttp_reset_. But unlikeGrpc::AsyncClient, which maps stream reset just to internal error, health checker distinguishes between active healthcheck errors (when service is in really bad mood) and network-ish healthcheck errors produced by stream reset.
I'm not sure if introducing all Grpc::AsyncClient state would make code more readable, but I will gladly add more comments about this flag:)
There was a problem hiding this comment.
Fair enough. I think it would be helpful to comment here on why we might have seen an RPC completion, without hitting the end-of-stream, and why this might lead to a following onResetStream() callback. Thanks.
There was a problem hiding this comment.
added some comments. PTAL if they are good enough
There was a problem hiding this comment.
Yeah, these are good, thanks. Maybe also explain how we ever get into the situation where we are in this branch, i.e. who called onRpcComplete with !end_stream as well.
There was a problem hiding this comment.
We're doing a bunch of work above that isn't strictly needed when log level is above debug. Is it possible to condition on log level?
There was a problem hiding this comment.
I haven't found how to do that, but I'm quite new to the codebase. Would appreciate any help :)
Signed-off-by: Alexey Baranov <me@kotiki.cc>
Signed-off-by: Alexey Baranov <me@kotiki.cc>
Catching GOAWAYs this way allows not writing a lot of nullptr checks Signed-off-by: Alexey Baranov <me@kotiki.cc>
650436c to
6edcca9
Compare
|
oops, did rebase( Promise to to do it anymore) |
Signed-off-by: Alexey Baranov <me@kotiki.cc>
6edcca9 to
b40cce6
Compare
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, in general this is extremely well done. Great work.
A few small comments and questions about coverage.
| Redis::ConnPool::ClientFactoryImpl::instance_); | ||
| case envoy::api::v2::HealthCheck::HealthCheckerCase::kGrpcHealthCheck: | ||
| if (!(cluster.info()->features() & Upstream::ClusterInfo::Features::HTTP2)) { | ||
| throw EnvoyException(fmt::format("{} cluster must support HTTP/2 for gRPC healthchecking", |
There was a problem hiding this comment.
Please have a test that does EXPECT_THROW_WITH_MESSAGE to catch this case.
| struct GrpcActiveHealthCheckSession : public ActiveHealthCheckSession, | ||
| public Http::StreamDecoder, | ||
| public Http::StreamCallbacks { | ||
| GrpcActiveHealthCheckSession(GrpcHealthCheckerImpl& parent, HostSharedPtr host); |
There was a problem hiding this comment.
nit: const HostSharedPtr& host (Please fix elsewhere in this file in old code if similar).
|
|
||
| // HealthCheckerImplBase | ||
| ActiveHealthCheckSessionPtr makeSession(HostSharedPtr host) override { | ||
| return ActiveHealthCheckSessionPtr{new GrpcActiveHealthCheckSession(*this, host)}; |
There was a problem hiding this comment.
nit: std::make_unique (please fix elsewhere in this file for similar code, since we are using C++14 now).
| } else { | ||
| grpc_status_message = fmt::format("{}", grpc_status); | ||
| } | ||
| ENVOY_CONN_LOG(error, "hc grpc_status={} service_status={} health_flags={}", *client_, |
There was a problem hiding this comment.
I think this should actually be debug level. (Since upstream can error and we don't want to spew logs by default).
There was a problem hiding this comment.
fixed. Just forgot to return back after testing
| if (!decoder_.decode(data, decoded_frames)) { | ||
| onRpcComplete(Grpc::Status::GrpcStatus::Internal, "gRPC wire protocol decode error", false); | ||
| } | ||
| for (auto& frame : decoded_frames) { |
There was a problem hiding this comment.
Do you have a test with the response split across two decodeData calls?
| } | ||
| } | ||
|
|
||
| void GrpcHealthCheckerImpl::GrpcActiveHealthCheckSession::decodeHeaders( |
There was a problem hiding this comment.
I haven't looked at the coverage results yet in CI, but please check and make sure you have coverage on all error branches in the code you added.
There was a problem hiding this comment.
checked, added more coverage
Signed-off-by: Alexey Baranov <me@kotiki.cc>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this LGTM other than sorting out proto build issues. @htuch do you have time to check out the branch and see what might be wrong? If not @jmillikin-stripe do you have a little time?
htuch
left a comment
There was a problem hiding this comment.
LGTM as well with some small suggestions on how to improve test readability (which are amazing BTW, that's some thorough testing!).
I'd like to avoid duplicating health.proto from the gRPC dep, so we should figure out why this isn't building. I'd have to poke around at the Bazel build dir and better understand the grpc_proto_library rule to do this. I'm fairly busy this week, but could look EOW if someone doesn't beat me to it.
| } | ||
|
|
||
| void respondResponseSpec(size_t index, ResponseSpec&& spec) { | ||
| bool trailers_empty = spec.trailers.size() == 0U; |
There was a problem hiding this comment.
Nit: can you constify the local tmps where they are immutable everywhere in the tests? Thanks.
| EXPECT_TRUE(cluster_->prioritySet().getMockHostSet(0)->hosts_[0]->healthy()); | ||
| } | ||
|
|
||
| TEST_F(GrpcHealthCheckerImplTest, SuccessResponseSplitBetweenChunks) { |
There was a problem hiding this comment.
Generally prefer a single line // comment above each test explaining in plain language what the test is trying to accomplish. E.g. // Validate that a regular response split across chunks is handled.
| EXPECT_CALL(*test_sessions_[0]->timeout_timer_, enableTimer(_)); | ||
| health_checker_->start(); | ||
|
|
||
| EXPECT_CALL(*test_sessions_[0]->interval_timer_, enableTimer(_)); |
There was a problem hiding this comment.
Can this be factored out into some expect function for healthy()? Seems very verbose to have this enable/disable timer stuff everywhere, which makes it a bit harder to read.
| expectSessionCreate(); | ||
| expectStreamCreate(0); | ||
| EXPECT_CALL(*test_sessions_[0]->timeout_timer_, enableTimer(_)); | ||
| health_checker_->start(); |
There was a problem hiding this comment.
I think there is also some common setup code that could be factored out.
- Factor out some helpers to make test bodies more readable. - Small explanation comments for each test Signed-off-by: Alexey Baranov <me@kotiki.cc>
|
Thanks for the test cleanups, this is rad. If we can get the health.proto build sorted this LGTM. |
|
@baranov1ch I looked into the health.proto build issue, fix is at https://gist.github.com/htuch/ff95a662d3c5684eb98d005f6bbbdb8e. |
Signed-off-by: Alexey Baranov <me@kotiki.cc>
|
@htuch thanks, it worked. I wish I could understand why :) |
* Add id to header exchange * remove extra include * review comments * review comments2
Fixes envoyproxy/envoy-mobile#2301 Discussion: https://envoyproxy.slack.com/archives/C02F93EEJCE/p1653485522681549 Signed-off-by: JP Simard <jp@jpsim.com>
Fixes envoyproxy/envoy-mobile#2301 Discussion: https://envoyproxy.slack.com/archives/C02F93EEJCE/p1653485522681549 Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: Alexey Baranov me@kotiki.cc
title: gRPC health checker
Description:
This PR introduces gRPC-based health checking cluster hosts that implement grpc.health.v1.Health. New
grpc_health_checkoption is added tohealth_checkssection of CDSv2.Clusterconfig. User can also setservice_namewhich will be passed inHealthCheckRequest.serviceinCheckRPC request payload. Cluster MUST be configured to support HTTP/2 to use gRPC health check, otherwise exception is thrown during configuration.Risk Level: Medium
Testing:
Rather comprehensive (imo :)) unit testing for configuration and operation. Also it was manually tested against grpc-go based server to catch a lot of graceful shutdown edge cases (see long comments in
GrpcHealthCheckerImpl::GrpcActiveHealthCheckSession).Docs Changes:
envoyproxy/data-plane-api#383
Release Notes:
Added gRPC healthcheck based on grpc.health.v1.Health service.
Fixes #369
API Changes:
envoyproxy/data-plane-api#383