Implemented MultiWatchdog.#13015
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/assign @antoniovicente |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
antoniovicente
left a comment
There was a problem hiding this comment.
Thanks for splitting the watchdogs for main and worker threads. Should help reduce the rate of spurious misses and increase the quality of signal we get from profiles on megamiss and other custom actions.
| // we'll use default values for that system. | ||
| message MultiWatchdog { | ||
| // Watchdog for the main thread. | ||
| Watchdog aux_watchdog = 1; |
There was a problem hiding this comment.
You may want to name this: main_thread_watchdog, let's see what the API shepherds say.
There was a problem hiding this comment.
Yeah, prefer main, we don't really talk about "aux" elsewhere.
There was a problem hiding this comment.
Done. Going with main_thread_watchdog as suggested.
| // This is used for specifying different watchdogs for the different subsystems. | ||
| // If both *multi_watchdog* and *watchdog* are specified, we'll use | ||
| // *multi_watchdog*. | ||
| MultiWatchdog multi_watchdog = 26; |
There was a problem hiding this comment.
I would have named this: WatchdogConfigs watchdog_configs
Let's see what the API shepherds say.
There was a problem hiding this comment.
Generally we prefer singular unless it's a repeated, but we have precedence for doing what you suggest earlier in the file, e.g. StaticResources. How about just Watchdogs watchdogs = ....
There was a problem hiding this comment.
Going with Watchdogs watchdogs as suggested. Thanks!
source/server/configuration_impl.cc
Outdated
| Instance& server) { | ||
| // TODO(kbaichoo): modify this to handle additional watchdogs | ||
| watchdog_ = std::make_unique<WatchdogImpl>(bootstrap.watchdog(), server); | ||
| if (bootstrap.has_multi_watchdog()) { |
There was a problem hiding this comment.
If neither set of parameters is passed in, should we default to separate guarddogs or single guarddog?
There was a problem hiding this comment.
For backwards compatibility, defaulting to the current behavior (single guard dog even if no params are set), makes sense, which is what this would do.
There was a problem hiding this comment.
A possible issue I see is that we'ld see a default behavior change when removing the old config parameters in the future. I think either option is fine.
There was a problem hiding this comment.
Good point. We can go with separate guard dogs with default configs that will be similar to the single guard dog case.
Once we've found good parameters that work for the different subsystems, we can do the behavior change then (and the change will be smaller)
| } else { | ||
| multi_watchdog_ = false; | ||
| watchdog_ = std::make_unique<WatchdogImpl>(bootstrap.watchdog(), server); | ||
| } |
There was a problem hiding this comment.
It seems like a mistake to configure both multi_watchdog and watchdog_config, we may want to reject config in that case.
There was a problem hiding this comment.
Good point, we should inform the end-user that both of those fields being set doesn't make sense.
It seems we can go about this in one of two ways:
- Warn here, and have a preference of one of the configs if both are set.
- Do validation such as in: and explicitly throw and error (there's no cleaner way of propagating the information AFAIK).
Line 435 in 82e611e
Thoughts?
There was a problem hiding this comment.
throw an exception if config is invalid.
KBaichoo
left a comment
There was a problem hiding this comment.
Thanks for the review!
source/server/configuration_impl.cc
Outdated
| Instance& server) { | ||
| // TODO(kbaichoo): modify this to handle additional watchdogs | ||
| watchdog_ = std::make_unique<WatchdogImpl>(bootstrap.watchdog(), server); | ||
| if (bootstrap.has_multi_watchdog()) { |
There was a problem hiding this comment.
For backwards compatibility, defaulting to the current behavior (single guard dog even if no params are set), makes sense, which is what this would do.
| } else { | ||
| multi_watchdog_ = false; | ||
| watchdog_ = std::make_unique<WatchdogImpl>(bootstrap.watchdog(), server); | ||
| } |
There was a problem hiding this comment.
Good point, we should inform the end-user that both of those fields being set doesn't make sense.
It seems we can go about this in one of two ways:
- Warn here, and have a preference of one of the configs if both are set.
- Do validation such as in: and explicitly throw and error (there's no cleaner way of propagating the information AFAIK).
Line 435 in 82e611e
Thoughts?
|
PTAL @envoyproxy/api-shepherds ; @antoniovicente has pointed out places where we would like input. Thanks! |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
include/envoy/server/configuration.h
Outdated
| * @return const Watchdog& the configuration of the watchdog. | ||
| * @return WatchdogOptConstRef the configuration of the watchdog. | ||
| * If multiWatchdog() returns true this will be empty. | ||
| */ | ||
| virtual WatchdogOptConstRef watchdogConfig() const PURE; | ||
|
|
||
| /** | ||
| * @return WatchdogOptConstRef the configuration of the auxiliary watchdog. | ||
| * This will be empty unless we're multiWatchdog() returns true. | ||
| */ | ||
| virtual WatchdogOptConstRef auxWatchdogConfig() const PURE; | ||
|
|
||
| /** | ||
| * @return WatchdogOptConstRef the configuration of the worker watchdog. | ||
| * This will be empty unless we're multiWatchdog() returns true. | ||
| */ | ||
| virtual WatchdogOptConstRef workerWatchdogConfig() const PURE; | ||
|
|
||
| /** | ||
| * @return Whether we're using multiWatchdog. | ||
| */ | ||
| virtual const Watchdog& watchdogConfig() const PURE; | ||
| virtual bool multiWatchdog() const PURE; |
There was a problem hiding this comment.
Requiring callers to interpret the results of calling a method by consulting a different method seems like an API smell. Would it be possible to get rid of watchdogConfig() and require callers to always treat auxWatchdogConfig() and workerWatchdogConfig() as distinct, even if they actually return the same config?
There was a problem hiding this comment.
The multiWatchdog() was meant more as a lighter weight method to figure out whether we're using the single watchdog or the multiwatchdog system. I could remove it if you'd like, but then I'd have the 2/3 places with something like:
if(watchdogConfig.empty()) : do multi watchdog config... else: do single watchdog
which is a bit confusing since I'm using either the negation of the other system being present or making multiple function calls looking for the presence of two optionals having values.
The main awkwardness with just having auxWatchdogConfig() and workerWatchdogConfig() is in the single guarddog for the system case there's going to be some function call like:
guarddog = make_unique<..>(config())
and using either auxWatchdogConfig() or workerWatchdogConfig() as the config() call would be confusing (even though they would be the same); alternatively, I could try to multiplex one of these calls but that would add more complexity to a single function.
There was a problem hiding this comment.
The problem is that there's no way to perfectly emulate the single watchdog case with multiple watchdogs, right?
There was a problem hiding this comment.
Yep. Shouldn't be an issue now though with always having multiple guarddogs in the system.
source/server/configuration_impl.h
Outdated
| std::unique_ptr<Watchdog> watchdog_; | ||
| std::unique_ptr<Watchdog> aux_watchdog_; | ||
| std::unique_ptr<Watchdog> worker_watchdog_; | ||
| bool multi_watchdog_; |
There was a problem hiding this comment.
Since the values of these fields are so tied together, consider using a type that enforces the invariants you want. Here, for instance, multi_watchdog_ tells you which of watchdog_ and (aux_watchdog_, worker_watchdog_) is valid. Consider using absl::variant as a type-safe union that will subsume these (and save some space!).
There was a problem hiding this comment.
I thought about this for a while...
Using an absl::variant but it seems like they're be a good amount of boilerplate since I'd have to wrap the aux_watchdog_ and worker_watchdog_ together somehow.
I could alternatively jump to using two watchdogs systems (vs the single watchdog system), while still supporting the single watchdog field. It'd be much easier to "emulate" most of the functionality of the single watchdog system using the two watchdog system.
This would make things a lot cleaner, but would break any cross-watchdog policies (such as multikill triggering under a combination of main_thread_watchdog and worker_watchdog), but that functionality doesn't seem too helpful.
I'm leaning towards the latter.
There was a problem hiding this comment.
I think the only boilerplate would be an additional struct MultiWatchDog { std::unique_ptr<Watchdog> aux_watchdog, worker_watchdog; }; that you'd template the variant with, plus maybe a constructor definition to make emplacing easier. You could also use a std::pair but that seems error-prone since the types are the same. I guess that the call sites would be a little messier with absl::variant since absl::visit isn't super nice to use.
I agree with your leaning. If you can simplify here by emulating the single watchdog case, that seems preferable.
There was a problem hiding this comment.
I've made the system always have two guard dogs which removes this issue.
If the user specifies the watchdog field (the old system), then we'll just have the two guarddogs have the same configuration.
source/server/guarddog_impl.cc
Outdated
|
|
||
| GuardDogImpl::GuardDogImpl(Stats::Scope& stats_scope, const Server::Configuration::Watchdog& config, | ||
| Api::Api& api, std::unique_ptr<TestInterlockHook>&& test_interlock) | ||
| Api::Api& api, const std::string& name, |
There was a problem hiding this comment.
nit: unless you need a std::string, use absl::string_view
| // Optional watchdog configuration. | ||
| // This is for a single watchdog configuration for the entire system. | ||
| // For finer granularity watchdogs use *multi_watchdog*. | ||
| Watchdog watchdog = 8; |
There was a problem hiding this comment.
Should this be deprecated?
There was a problem hiding this comment.
Yeah, I'd be keen to not overlap mechanism long-term, so let's deprecate.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…e time options (deprecated warnings). Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
| EXPECT_NO_THROW(initialize("test/server/test_data/server/empty_bootstrap.yaml")); | ||
| EXPECT_NE(nullptr, TestUtility::findCounter(stats_store_, "server.watchdog_miss")); | ||
| EXPECT_NE(nullptr, TestUtility::findCounter(stats_store_, "main_thread.watchdog_miss")); | ||
| EXPECT_NE(nullptr, TestUtility::findCounter(stats_store_, "workers.watchdog_miss")); |
There was a problem hiding this comment.
Existing monitoring for "server.watchdog_miss" will break for users that are configured using the current single watchdog parameter.
Is this acceptable? Worth mentioning i release notes?
There was a problem hiding this comment.
I've added it to the release notes.
From a lot of the discussion with @akonradi, trying to maintain full backwards compatibility would end up with significant complexity within the system. Furthermore, as you pointed out in #13015 (comment), there will have to be some breakage at some point (either in the future or now).
Given those, I went for simplicity at the cost of minor behavior changes now.
There was a problem hiding this comment.
Yeah we don't have a stats deprecation policy right now, so it's ad hoc and "best judgement" depending on how many people we think we will break. I think this is OK.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
| * router: added transport failure reason to response body when upstream reset happens. After this change, the response body will be of the form `upstream connect error or disconnect/reset before headers. reset reason:{}, transport failure reason:{}`.This behavior may be reverted by setting runtime feature `envoy.reloadable_features.http_transport_failure_reason_in_body` to false. | ||
| * router: now consumes all retry related headers to prevent them from being propagated to the upstream. This behavior may be reverted by setting runtime feature `envoy.reloadable_features.consume_all_retry_headers` to false. | ||
| * thrift_proxy: special characters {'\0', '\r', '\n'} will be stripped from thrift headers. | ||
| * watchdog: replaced single watchdog with separate watchdog configuration for worker threads and for the main thread :ref:`Watchdogs<envoy_v3_api_field_config.bootstrap.v3.Bootstrap.watchdogs>`. It works with :ref:`watchdog<envoy_v3_api_field_config.bootstrap.v3.Bootstrap.watchdog>` by having the worker thread and main thread watchdogs have same config. The two guarddogs will have the following prefixes: `main_thread` and `workers` for their stats and dispatcher names. Anything monitoring these will need to be updated. |
There was a problem hiding this comment.
Not sure what section the monitoring changes should be added to. Consider adding to "Incompatible Behavior Changes" instead
|
|
||
| const Watchdog& mainThreadWatchdogConfig() const override { return *main_thread_watchdog_; } | ||
|
|
||
| const Watchdog& workerWatchdogConfig() const override { return *worker_watchdog_; } |
There was a problem hiding this comment.
nit: The newlines between the accessor methods is not required, consider removing them for consistency with other accessors above.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
PTAL @envoyproxy/api-shepherds . Thanks. |
|
/lgtm api |
mattklein123
left a comment
There was a problem hiding this comment.
Nice! Just a few small comments. Thank you.
/wait
| EXPECT_NO_THROW(initialize("test/server/test_data/server/empty_bootstrap.yaml")); | ||
| EXPECT_NE(nullptr, TestUtility::findCounter(stats_store_, "server.watchdog_miss")); | ||
| EXPECT_NE(nullptr, TestUtility::findCounter(stats_store_, "main_thread.watchdog_miss")); | ||
| EXPECT_NE(nullptr, TestUtility::findCounter(stats_store_, "workers.watchdog_miss")); |
There was a problem hiding this comment.
Yeah we don't have a stats deprecation policy right now, so it's ad hoc and "best judgement" depending on how many people we think we will break. I think this is OK.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo kbaichoo@google.com
Commit Message: Added the ability to specific different watchdog configurations for worker threads and the main thread.
Additional Description:
Risk Level: medium
Testing: Unit tests
Docs Changes: Included.
Release Notes: Included.
Fixes #12927