add x-envoy-immediate-health-check-fail header support by mattklein123 · Pull Request #1570 · envoyproxy/envoy

mattklein123 · 2017-08-30T04:07:49Z

This feature adds the ability for data plane processing to cause a host
to be considered active health check failed. This is currently only used
by the router filter and the health check filter, but could be extended
to other protocols later.

Fixes #1423

This feature adds the ability for data plane processing to cause a host to be considered active health check failed. This is currently only used by the router filter and the health check filter, but could be extended to other protocols later. Fixes #1423

mattklein123 · 2017-08-30T04:08:16Z

@htuch will trade you for the other review tomorrow.

cc @alyssawilk also.

htuch

Overall design looks good. I'm wondering if there is a way to simplify the ownership graph.

htuch · 2017-08-30T13:48:36Z

docs/configuration/http_filters/router_filter.rst

+upstream host has failed :ref:`active health checking <arch_overview_health_checking>` (if the
+cluster has been :ref:`configured <config_cluster_manager_cluster_hc>` for active health checking).
+This can be used to fast fail an upstream host via standard data plane processing without waiting
+for the next health check interval. See the :ref:`health checking overview


Can you comment on how the host can become considered healthy again (i.e. next health check)?

Alternately for load shedding we have a hard-coded duration for it to expire. It'd be nice to have either a set expiry time or a optional ttl as header value

Yeah, for load shedding, IMO, I would treat that as more of an outlier ejection event vs. an active health check event. I think that would be a useful feature but I think is out of scope for this PR.

Fair enough. I was mainly wondering if for reverse compatibility we should make the value extensible. As I understand envoy versioning we could always change the requirements for the value as a breaking change gated on an envoy version update - is this about right?

The current code doesn't look at the value of the header at all, so, in the future we can do anything we want. But yes in general we could have a deprecation window, etc. I would recommend in this case though if we want to start using the header value we should think about forward-compat. I will open a follow up issue to think about this more.

Sound good, thanks!

htuch · 2017-08-30T13:49:21Z

docs/configuration/http_filters/health_check_filter.rst

+Note that the filter will automatically set the :ref:`x-envoy-immediate-health-check-fail
+<config_http_filters_router_x-envoy-immediate-health-check-fail>` header if the
+:ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin endpoint has been
+called.


Is there a way to unset this state? I.e. call into the admin endpoint to become healthy again? I'm thinking specifically of using this feature for load shedding.

htuch · 2017-08-30T13:55:01Z

include/envoy/upstream/health_checker_sink.h

+ * special HTTP header is received, the data plane may decide to fast fail a host to avoid waiting
+ * for the full HC interval to elapse before determining the host is active HC failed.
+ */
+class HealthCheckerSink {


I'm wondering if there is a clearer term than "sink" here, maybe HealthCheckMonitor or something on the "watching" theme. My mental model is that we're looking for events from upstream that indicate that it's time to go unhealthy.

Sure that's fine. FWIW I just replicated the naming for the outlier detector stuff which basically works the same way with same naming. How about HealthCheckHostMonitor and DetectorHostMonitor ?

htuch · 2017-08-30T14:11:47Z

source/common/upstream/health_checker_impl.cc

+void HealthCheckerImplBase::setUnhealthyCrossThread(const HostSharedPtr& host) {
+  // The threading here is complex. The cluster owns the only strong reference to the health
+  // checker. It might go away when we post to the main thread. We capture a weak reference and
+  // make sure it is still valid when we get to the other thread. Additionally, the host/session


Would be helpful to be clearer here on which thread is which when referring to "other thread". I.e. we have a worker thread (presumably monitoring the passive header check) communicating with the main thread?

htuch · 2017-08-30T14:16:56Z

source/common/upstream/health_checker_impl.cc

+  // may also be gone by then so we check that also.
+  std::weak_ptr<HealthCheckerImplBase> weak_this = shared_from_this();
+  dispatcher_.post([weak_this, host]() -> void {
+    std::shared_ptr<HealthCheckerImplBase> shared_this = weak_this.lock();


This logic seems complicated. Some of this is inherent due to ownership structure, but I wonder if it could be simplified by having the main thread (in the post body) do the cluster -> health checker resolution, and only passing in the cluster? Or host+cluster -> health checker resolution?

The weak pointers seem valid, but it becomes fairly hard to reason about the combination of shared_ptr and then multiple weak_ptrs in the sink.

Again FWIW this is basically identical to logic we do for outlier detection: https://github.com/lyft/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L220

I agree the logic is complicated, but I'm not quite sure how to make it simpler. Even if we pass a cluster, we need to pass a weak_ptr, because the code relies on clusters going away on the main thread immediately. I figured it was better to have all the shared_ptr/weak_ptr logic internal to this thing like we did in outlier detector? Let me try adding some more comments.

mattklein123 · 2017-08-30T17:35:40Z

@htuch @alyssawilk PR updated per comments.

htuch

LGTM

htuch · 2017-08-30T18:23:53Z

include/envoy/upstream/upstream.h

-   * a new outlier detector must be installed before the host is used across threads. Thus,
+   * Set the host's health checker monitor. Monitors are assumed to be thread safe, however
+   * a new monitor must be installed before the host is used across threads. Thus,
   * this routine should only be called on the main thread before the host is used across threads.


Maybe add an ASSERT verifying we're on the main thread in the implementation?

There is no great way to do this because of how we use hosts in tests, etc. I think I'm going to skip this for now. I will make a note to look at it in a follow up.

htuch · 2017-08-30T18:24:21Z

include/envoy/upstream/outlier_detection.h

 };

-typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;
+typedef std::unique_ptr<DetectorHostMonitor> DetectorHostMonitorPtr;


Yeah, I think this nomenclature is easier to parse for me.

htuch · 2017-08-30T18:25:35Z

source/common/upstream/health_checker_impl.cc

+  // 1) We capture a weak reference to the health checker and post it from worker thread to main
+  //    thread.
+  // 2) On the main thread, we make sure it is still valid (as the cluster may have been destroyed).
+  // 3) Additionally, the host/session may also be gone by then so we check that also.


Thanks, this is clear to me now.

Was broken by interaction of envoyproxy#1521 and envoyproxy#1570.

Was broken by interaction of #1521 and #1570.

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

htuch reviewed Aug 30, 2017

View reviewed changes

mattklein123 added 2 commits August 30, 2017 10:28

comments

1ef3cd8

fix

c79b34f

mattklein123 mentioned this pull request Aug 30, 2017

Further work on x-envoy-immediate-health-check-fail #1572

Closed

3 tasks

htuch reviewed Aug 30, 2017

View reviewed changes

htuch approved these changes Aug 30, 2017

View reviewed changes

mattklein123 merged commit 669f23c into master Aug 30, 2017

mattklein123 deleted the hc_sink branch August 30, 2017 19:58

htuch added a commit to htuch/envoy that referenced this pull request Aug 31, 2017

test: fix //test/common/upstream:cluster_manager_impl_test_lib.

7357f1e

Was broken by interaction of envoyproxy#1521 and envoyproxy#1570.

htuch mentioned this pull request Aug 31, 2017

test: fix //test/common/upstream:cluster_manager_impl_test_lib. #1580

Merged

mattklein123 pushed a commit that referenced this pull request Aug 31, 2017

test: fix //test/common/upstream:cluster_manager_impl_test_lib. (#1580)

4de12ad

Was broken by interaction of #1521 and #1570.

jpsim pushed a commit that referenced this pull request Nov 28, 2022

release: 0.4.2.07022021 (#1570)

36674b6

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

jpsim pushed a commit that referenced this pull request Nov 29, 2022

release: 0.4.2.07022021 (#1570)

19e396e

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

mathetake added a commit that referenced this pull request Mar 3, 2026

refactor: simplifies endpoint prefix config init (#1570)

123e692

Conversation

mattklein123 commented Aug 30, 2017

Uh oh!

mattklein123 commented Aug 30, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Aug 30, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants