Conversation
Signed-off-by: Lilika Markatou <lilika@google.com>
Flaky test Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
|
@htuch @alyssawilk who do you want to take a first pass on this? |
|
@mattklein123 will do; I've already done some review feedback in a private PR, so this is close to final shape, but will give another pass. |
|
@htuch OK let me know when you are done and I can take a final pass. |
Signed-off-by: Lilika Markatou <lilika@google.com>
| health_check_request_.mutable_node()->MergeFrom(node); | ||
| retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); }); | ||
| response_timer_ = dispatcher.createTimer([this]() -> void { sendHealthCheckRequest(); }); | ||
| server_response_timer_ = dispatcher.createTimer([this]() -> void { sendResponse(); }); |
There was a problem hiding this comment.
I would rename these two timers to hds_retry_timer_ and hds_stream_response_timer_ respectively to make it clearer what they do.
| envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse | ||
| HdsDelegate::sendResponse() { | ||
| envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse response; | ||
| for (auto& cluster : hds_clusters_) { |
| for (const auto& hosts : cluster->prioritySet().hostSetsPerPriority()) { | ||
| for (const auto& host : hosts->hosts()) { | ||
| auto* endpoint = response.mutable_endpoint_health_response()->add_endpoints_health(); | ||
| endpoint->mutable_endpoint()->mutable_address()->mutable_socket_address()->set_address( |
There was a problem hiding this comment.
You might want to use https://github.com/envoyproxy/envoy/blob/master/source/common/network/utility.h#L208 here.
| ENVOY_LOG(debug, "New health check response message {} ", message->DebugString()); | ||
| ASSERT(message); | ||
|
|
||
| for (auto& cluster_health_check : message->health_check()) { |
|
|
||
| for (auto& cluster_health_check : message->health_check()) { | ||
| // Create HdsCluster config | ||
| envoy::api::v2::core::BindConfig bind_config; |
| info_factory_)); | ||
|
|
||
| for (auto& health_check : cluster_config.health_checks()) { | ||
| health_checkers_.push_back( |
There was a problem hiding this comment.
Arguably, HdsCluster could be response for owning and maintaining health_checkers_. Any thoughts here?
| ENVOY_LOG(debug, "New health check response message {} ", message->DebugString()); | ||
|
|
||
| // Process the HealthCheckSpecifier message | ||
| processMessage(std::move(message)); |
There was a problem hiding this comment.
We definitely should at least reset hds_clusters_ between messages, otherwise it will just keep growing on each update forever.
There was a problem hiding this comment.
I assume no updates in this PR.
| ssl_context_manager_, secret_manager_, added_via_api_); | ||
|
|
||
| for (const auto& host : cluster.hosts()) { | ||
| initial_hosts_->emplace_back(HostSharedPtr{ |
There was a problem hiding this comment.
Can this just be emplace_back(new HostImpl..?
| ClusterSharedPtr HdsCluster::create() { NOT_IMPLEMENTED_GCOVR_EXCL_LINE; } | ||
|
|
||
| HostVectorConstSharedPtr HdsCluster::createHealthyHostList(const HostVector& hosts) { | ||
| HostVectorSharedPtr healthy_list(new HostVector()); |
There was a problem hiding this comment.
Would it make sense to just inherit from ClusterImplBase to get some of these methods for free?
| const uint32_t RETRY_DELAY_MS = 5000; | ||
| const uint32_t RetryDelayMilliseconds = 5000; | ||
| static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345; | ||
| static constexpr uint32_t ClusterTimeoutSeconds = 1; |
There was a problem hiding this comment.
Can you add comments for all of these constants to explain what they do/represent?
Signed-off-by: Lilika Markatou <lilika@google.com>
htuch
left a comment
There was a problem hiding this comment.
Great, I have some test comments on this pass.
| } | ||
|
|
||
| void HdsDelegate::setServerResponseTimer() { | ||
| hds_stream_response_timer_->enableTimer(std::chrono::milliseconds(server_response_ms_)); |
There was a problem hiding this comment.
Can you rename the method to reflect the new field names?
| * server with a set of hosts to healthcheck, healthchecking them, and reporting | ||
| * back the results. | ||
| */ | ||
|
|
| // TODO(htuch): Make this configurable or some static. | ||
| const uint32_t RETRY_DELAY_MS = 5000; | ||
|
|
||
| // How often we retry to establish a stream |
| const uint32_t RetryDelayMilliseconds = 5000; | ||
|
|
||
| // Soft limit on size of the cluster’s connections read and write buffers. | ||
| static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345; |
There was a problem hiding this comment.
12345 is a strange buffer size :) Probably best to make it power-of-two.
| static constexpr uint32_t ClusterTimeoutSeconds = 1; | ||
|
|
||
| // How often envoy reports the healthcheck results to the server | ||
| uint32_t server_response_ms_ = 1000; |
There was a problem hiding this comment.
Should this default initialize to 0?
test/common/upstream/hds_test.cc
Outdated
| dispatcher_, runtime_, stats_store_, ssl_context_manager_, | ||
| secret_manager_, random_, test_factory_, log_manager_)); | ||
| } | ||
| envoy::service::discovery::v2::HealthCheckSpecifier* simpleMessage() { |
There was a problem hiding this comment.
Prefer to name methods as verbs, since they do something, e.g. createSimpleMessage.
test/common/upstream/hds_test.cc
Outdated
| auto* health_check2 = message->add_health_check(); | ||
| health_check2->set_cluster_name("voronoi"); | ||
|
|
||
| auto* address4 = health_check2->add_endpoints()->add_endpoints()->mutable_address(); |
There was a problem hiding this comment.
You could factor out a lot of this endpoint addition boiler plate to a helper function.
| EXPECT_EQ(host6->address()->ip()->port(), 8765); | ||
| } | ||
|
|
||
| TEST_F(HdsTest, TestProcessMessageHealthChecks) { |
There was a problem hiding this comment.
Can you add a one line description above each TEST_F explaining what the test does?
| EXPECT_EQ(1, test_server_->counter("cluster.anna.health_check.failure")->value()); | ||
| } | ||
|
|
||
| TEST_P(HdsIntegrationTest, TwoEndpointsSameLocality) { |
There was a problem hiding this comment.
Also add explanatory test summary above each TEST_P here.
| host2_stream->waitForEndStream(*dispatcher_); | ||
|
|
||
| // Check that the healthcheck messages are correct | ||
| EXPECT_STREQ(host_stream->headers().Path()->value().c_str(), "/healthcheck"); |
There was a problem hiding this comment.
Can you refactor these tests to reduce some of the boiler plate?
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
htuch
left a comment
There was a problem hiding this comment.
Looks good. I haven't fully gone over all tests, but I think this is now ready for another set of eyes. @mattklein123 want to take a pass? I think the main thing to discuss design wise is to what extent HdsCluster should reuse ClusterImplBase machinery, given it needs to only do a very limited subset of regular cluster behavior, vs. its own simplistic re-implementation of the cluster interface.
| typedef std::unique_ptr<HdsDelegate> HdsDelegatePtr; | ||
|
|
||
| // Friend class of HdsDelegate, making it easier to access private fields | ||
| class HdsDelegateFriend { |
There was a problem hiding this comment.
I think you can just forward declare class HdsDelegateFriend and only fill in its contents in the test module.
| hds_stream_->sendGrpcMessage(server_health_check_specifier); | ||
| // Wait until the request has been received by Envoy. | ||
| test_server_->waitForCounterGe("hds_delegate.requests", ++hds_requests_); | ||
| if (cluster2 != "None") { |
There was a problem hiding this comment.
Maybe just use empty string instead of "None", that should also be an invalid cluster name and is more idiomatic C++ for an optional.
Signed-off-by: Lilika Markatou <lilika@google.com>
|
@mattklein123 friendly ping. |
|
Sorry I'm behind on reviews will do today. |
mattklein123
left a comment
There was a problem hiding this comment.
At a high level looks great to me. Some comments for TODOs for future work if you have time before your internship ends. Nice work!
| /** | ||
| * Factory for creating ClusterInfo | ||
| */ | ||
|
|
There was a problem hiding this comment.
nit: kill newline between comment and think commented on. Same elsewhere.
| // How often we retry to establish a stream to the management server | ||
| const uint32_t RetryDelayMilliseconds = 5000; | ||
|
|
||
| // Soft limit on size of the cluster’s connections read and write buffers. |
There was a problem hiding this comment.
FWIW I think it's odd that this setting and timeout are hard coded, and IMO this is a deficiency in the API that we should fix. Can you add a TODO around settings that are currently hard coded that we probably want to eventually add API knobs for?
| * server with a set of hosts to healthcheck, healthchecking them, and reporting | ||
| * back the results. | ||
| */ | ||
| class HdsDelegate |
There was a problem hiding this comment.
Can we add a TODO around adding /config_dump support for this? I think it would be extremely useful to be able to see what hosts we are health checking like we do for the other xDS endpoints. Also, might consider adding a TODO around whether we want to add health check clusters to the /clusters endpoint to get detailed stats about each HC host. Also I think would be extremely useful for debugging.
Signed-off-by: Lilika Markatou <lilika@google.com>
|
Thanks! I added the TODOs. |
htuch
left a comment
There was a problem hiding this comment.
LGTM modulo a few last nits, ready to ship when fixed.
test/common/upstream/hds_test.cc
Outdated
| Stats::IsolatedStoreImpl stats_store_; | ||
| MockClusterInfoFactory test_factory_; | ||
|
|
||
| std::shared_ptr<Upstream::HdsDelegate> hds_delegate_; |
There was a problem hiding this comment.
Does this really need to be a shared_ptr? Could it just be unique? Shared pointers should only be used when non-trivial ownership semantics are required.
test/common/upstream/hds_test.cc
Outdated
| // Creates a HealthCheckSpecifier message that contains one endpoint and one | ||
| // healthcheck | ||
| envoy::service::discovery::v2::HealthCheckSpecifier* createSimpleMessage() { | ||
|
|
test/common/upstream/hds_test.cc
Outdated
| } | ||
|
|
||
| // Test if processMessage processes healthchecks from a HealthCheckSpecifier | ||
| // message correctly |
test/common/upstream/hds_test.cc
Outdated
| } | ||
| } | ||
|
|
||
| // Test if processMessage processes healthchecks from a HealthCheckSpecifier |
There was a problem hiding this comment.
Nit: s/healthchecks/health checks/
Signed-off-by: Lilika Markatou <lilika@google.com>
This reverts commit f3b0f85. Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
…proxy#4063)" This reverts commit 4e52589. Signed-off-by: Lilika Markatou <lilika@google.com>
This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking.
Risk Level: Low
Testing:
There are integration tests in test/integration/hds_integration_test.cc
and unit tests in test/common/upstream/hds_test.cc.
This work is for #1310.