add x-envoy-immediate-health-check-fail header support#1570
add x-envoy-immediate-health-check-fail header support#1570mattklein123 merged 3 commits intomasterfrom
Conversation
This feature adds the ability for data plane processing to cause a host to be considered active health check failed. This is currently only used by the router filter and the health check filter, but could be extended to other protocols later. Fixes #1423
|
@htuch will trade you for the other review tomorrow. cc @alyssawilk also. |
htuch
left a comment
There was a problem hiding this comment.
Overall design looks good. I'm wondering if there is a way to simplify the ownership graph.
| upstream host has failed :ref:`active health checking <arch_overview_health_checking>` (if the | ||
| cluster has been :ref:`configured <config_cluster_manager_cluster_hc>` for active health checking). | ||
| This can be used to fast fail an upstream host via standard data plane processing without waiting | ||
| for the next health check interval. See the :ref:`health checking overview |
There was a problem hiding this comment.
Can you comment on how the host can become considered healthy again (i.e. next health check)?
There was a problem hiding this comment.
Alternately for load shedding we have a hard-coded duration for it to expire. It'd be nice to have either a set expiry time or a optional ttl as header value
There was a problem hiding this comment.
Yeah, for load shedding, IMO, I would treat that as more of an outlier ejection event vs. an active health check event. I think that would be a useful feature but I think is out of scope for this PR.
There was a problem hiding this comment.
Fair enough. I was mainly wondering if for reverse compatibility we should make the value extensible. As I understand envoy versioning we could always change the requirements for the value as a breaking change gated on an envoy version update - is this about right?
There was a problem hiding this comment.
The current code doesn't look at the value of the header at all, so, in the future we can do anything we want. But yes in general we could have a deprecation window, etc. I would recommend in this case though if we want to start using the header value we should think about forward-compat. I will open a follow up issue to think about this more.
| Note that the filter will automatically set the :ref:`x-envoy-immediate-health-check-fail | ||
| <config_http_filters_router_x-envoy-immediate-health-check-fail>` header if the | ||
| :ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin endpoint has been | ||
| called. |
There was a problem hiding this comment.
Is there a way to unset this state? I.e. call into the admin endpoint to become healthy again? I'm thinking specifically of using this feature for load shedding.
| * special HTTP header is received, the data plane may decide to fast fail a host to avoid waiting | ||
| * for the full HC interval to elapse before determining the host is active HC failed. | ||
| */ | ||
| class HealthCheckerSink { |
There was a problem hiding this comment.
I'm wondering if there is a clearer term than "sink" here, maybe HealthCheckMonitor or something on the "watching" theme. My mental model is that we're looking for events from upstream that indicate that it's time to go unhealthy.
There was a problem hiding this comment.
Sure that's fine. FWIW I just replicated the naming for the outlier detector stuff which basically works the same way with same naming. How about HealthCheckHostMonitor and DetectorHostMonitor ?
| void HealthCheckerImplBase::setUnhealthyCrossThread(const HostSharedPtr& host) { | ||
| // The threading here is complex. The cluster owns the only strong reference to the health | ||
| // checker. It might go away when we post to the main thread. We capture a weak reference and | ||
| // make sure it is still valid when we get to the other thread. Additionally, the host/session |
There was a problem hiding this comment.
Would be helpful to be clearer here on which thread is which when referring to "other thread". I.e. we have a worker thread (presumably monitoring the passive header check) communicating with the main thread?
| // may also be gone by then so we check that also. | ||
| std::weak_ptr<HealthCheckerImplBase> weak_this = shared_from_this(); | ||
| dispatcher_.post([weak_this, host]() -> void { | ||
| std::shared_ptr<HealthCheckerImplBase> shared_this = weak_this.lock(); |
There was a problem hiding this comment.
This logic seems complicated. Some of this is inherent due to ownership structure, but I wonder if it could be simplified by having the main thread (in the post body) do the cluster -> health checker resolution, and only passing in the cluster? Or host+cluster -> health checker resolution?
The weak pointers seem valid, but it becomes fairly hard to reason about the combination of shared_ptr and then multiple weak_ptrs in the sink.
There was a problem hiding this comment.
Again FWIW this is basically identical to logic we do for outlier detection: https://github.com/lyft/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L220
I agree the logic is complicated, but I'm not quite sure how to make it simpler. Even if we pass a cluster, we need to pass a weak_ptr, because the code relies on clusters going away on the main thread immediately. I figured it was better to have all the shared_ptr/weak_ptr logic internal to this thing like we did in outlier detector? Let me try adding some more comments.
|
@htuch @alyssawilk PR updated per comments. |
| * a new outlier detector must be installed before the host is used across threads. Thus, | ||
| * Set the host's health checker monitor. Monitors are assumed to be thread safe, however | ||
| * a new monitor must be installed before the host is used across threads. Thus, | ||
| * this routine should only be called on the main thread before the host is used across threads. |
There was a problem hiding this comment.
Maybe add an ASSERT verifying we're on the main thread in the implementation?
There was a problem hiding this comment.
There is no great way to do this because of how we use hosts in tests, etc. I think I'm going to skip this for now. I will make a note to look at it in a follow up.
| }; | ||
|
|
||
| typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr; | ||
| typedef std::unique_ptr<DetectorHostMonitor> DetectorHostMonitorPtr; |
There was a problem hiding this comment.
Yeah, I think this nomenclature is easier to parse for me.
| // 1) We capture a weak reference to the health checker and post it from worker thread to main | ||
| // thread. | ||
| // 2) On the main thread, we make sure it is still valid (as the cluster may have been destroyed). | ||
| // 3) Additionally, the host/session may also be gone by then so we check that also. |
Was broken by interaction of envoyproxy#1521 and envoyproxy#1570.
Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>
This feature adds the ability for data plane processing to cause a host
to be considered active health check failed. This is currently only used
by the router filter and the health check filter, but could be extended
to other protocols later.
Fixes #1423