Watchdog Extension: Profile Action#12636
Conversation
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
/assign @akonradi PTAL Alex |
|
neither of PTAL, Alex can be assigned to this issue. |
|
/assign @akonradi PTAL Alex |
|
/retest |
|
Retrying Azure Pipelines, to retry CircleCI checks, use |
|
/retest-circle |
|
🔨 rebuilding |
api/envoy/extensions/watchdog/profile_action/v3alpha/profile_action.proto
Outdated
Show resolved
Hide resolved
| string profile_path = 2 [(validate.rules).string = {min_bytes: 1}]; | ||
|
|
||
| // Limits the max number of profiles that will be written to disk by a given | ||
| // thread to avoid filling disks. If not set (i.e. it's 0), a default of 10 |
There was a problem hiding this comment.
Can you elaborate on this? E.g. is this a thread lifetime maximum, or does it apply each time it is triggered?
There was a problem hiding this comment.
It's a thread lifetime maximum. I updated the comment to clarify the intent.
| // Limits the max number of profiles that will be written to disk by a given | ||
| // thread to avoid filling disks. If not set (i.e. it's 0), a default of 10 | ||
| // will be used. | ||
| uint64 max_profiles_per_thread = 3; |
There was a problem hiding this comment.
Possibly you might want to use UInt64Value, but only if there will ever be a possible sensible setting for 0 here.
There was a problem hiding this comment.
I think since this is an optional extension, if you're including it that is at least some intent to use it.
Otherwise it'd just be an unnecessary no-op.
There was a problem hiding this comment.
I think the intent here was max_profiles to collect per process, not max per stuck thread :-/
akonradi
left a comment
There was a problem hiding this comment.
Flushing comments before moving on to tests.
| envoy::extensions::watchdog::profile_action::v3alpha::ProfileActionConfig& config, | ||
| Server::Configuration::GuardDogActionFactoryContext& context) | ||
| : path_(config.profile_path()), running_profile_(false), | ||
| max_profiles_per_tid_(config.max_profiles_per_thread() ? config.max_profiles_per_thread() |
There was a problem hiding this comment.
nit: the implicit conversion from uint64 to bool is correct, but can be error prone. Prefer to check config.max_profiles_per_thread() == 0 and let the compiler optimize.
| Server::Configuration::GuardDogActionFactoryContext& context); | ||
|
|
||
| void run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::WatchdogEvent event, | ||
| std::vector<std::pair<Thread::ThreadId, MonotonicTime>> thread_ltt_pairs, |
There was a problem hiding this comment.
I see this method is from the GuardDogAction interface, but asking here since you authored that too 😄 : why is this std::vector passed by copy instead of by const reference?
There was a problem hiding this comment.
Good catch, I do think const ref might be a better way to go about this.
|
|
||
| auto& fs = context_.api_.fileSystem(); | ||
| if (!fs.directoryExists(path_)) { | ||
| ENVOY_LOG_MISC(error, "Directory path {} doesn't exists.", path_); |
| tid_to_profile_count_[*trigger_tid] += 1; | ||
|
|
||
| // Schedule callback to stop | ||
| timer_cb_ = context_.dispatcher_.createTimer([this, profile_filename] { |
There was a problem hiding this comment.
Creating timers is expensive. Can we create the timer once and change the profile_filename that it references via a member variable?
| if (Profiler::Cpu::profilerEnabled()) { | ||
| Profiler::Cpu::stopProfiler(); | ||
| running_profile_ = false; | ||
| timer_cb_->disableTimer(); |
There was a problem hiding this comment.
I don't understand this call. It looks like the timer is only enabled here, and timers are always one-shot, so the timer should be disabled after this callback runs.
| absl::optional<Thread::ThreadId> tid_to_profile; | ||
|
|
||
| // Find a TID not over the max_profiles threshold | ||
| for (const auto& tid_time_pair : thread_ltt_pairs) { |
There was a problem hiding this comment.
Envoy is using C++17 now, so we can use tuple destructuring here :)
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
akonradi
left a comment
There was a problem hiding this comment.
Thanks for the changes! Just a few more nits on my end
| TestUtility::loadFromJson(absl::Substitute(R"EOF({ | ||
| "profile_duration": "1s", | ||
| "profile_path": "$0", | ||
| } | ||
| )EOF", | ||
| test_path_), | ||
| config); |
There was a problem hiding this comment.
This can probably be more succinctly spelled out by just calling the setters on the config object.
| tid_to_profile = tid; | ||
| break; |
There was a problem hiding this comment.
Can't this just be return tid; and return absl::nullopt; at the bottom? That lets you get rid of the extra variable.
| // Check we can do multiple profiles | ||
| dispatcher_->post([&tid_ltt_pairs, &now, this]() -> void { | ||
| action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::MISS, tid_ltt_pairs, now); | ||
| absl::MutexLock lock(&mutex_); | ||
| outstanding_notifies_ += 1; | ||
| }); | ||
|
|
||
| waitForOutstandingNotify(); | ||
| time_system_->advanceTimeWait(std::chrono::seconds(2)); | ||
|
|
||
| dispatcher_->exit(); | ||
| thread_->join(); | ||
|
|
||
| #ifdef PROFILER_AVAILABLE | ||
| EXPECT_EQ(countNumberOfProfileInPath(test_path_), 2); | ||
| #else | ||
| // Profiler won't run in this case, so there should be no files generated. | ||
| EXPECT_EQ(countNumberOfProfileInPath(test_path_), 0); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
This seems like a separate behavior than the single-profile case. Could/should this be a separate test? If not, should the EXPECT_EQ above be asserts?
There was a problem hiding this comment.
Will split out the single profile case from this.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…on-profiler Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
ping @akonradi |
| if (GetParam() == TimeSystemType::Real) { | ||
| return std::make_unique<Event::GlobalTimeSystem>(); | ||
| } | ||
| ASSERT(GetParam() == TimeSystemType::Simulated); | ||
| return std::make_unique<Event::SimulatedTimeSystem>(); |
There was a problem hiding this comment.
Using a switch statement will get you the same safety properties but at compile time instead of run time.
| } | ||
|
|
||
| void | ||
| setupAction(envoy::extensions::watchdog::profile_action::v3alpha::ProfileActionConfig& config) { |
There was a problem hiding this comment.
Consider inlining this. It will make your tests a little more verbose but since they often access action_ directly, it's helpful to see that being set instead of hiding it behind a method.
There was a problem hiding this comment.
Done. It more the test body is more complete this way, and this wasn't really hiding much.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
@lizan since you're the on-call maintainer, can you dispatch this who the right person is to review this PR? Thanks! |
…to the constructor. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…efactoring the timer creation to the ctor. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
PTAL @envoyproxy/api-shepherds |
|
Need a master merge @KBaichoo |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
|
PTAL @envoyproxy/senior-maintainers |
zuercher
left a comment
There was a problem hiding this comment.
Thanks. Generally looks good. I had one question for the other maintainers about the well-known names class, but otherwise small stuff.
| } | ||
|
|
||
| /** | ||
| * Static registration for the fixed heap resource monitor factory. @see RegistryFactory. |
There was a problem hiding this comment.
nit: comment refers to the wrong kind of extension.
| // Check if there's a tid that justifies profiling | ||
| auto trigger_tid = getTidTriggeringProfile(thread_ltt_pairs); | ||
| if (!trigger_tid.has_value()) { | ||
| ENVOY_LOG_MISC(warn, "None of the provide tids justify profiling"); |
There was a problem hiding this comment.
nit: provided
Also, since you're using the misc logger, perhaps this string should contain "profile action" to help someone reading the logs understand what's being warned.
There was a problem hiding this comment.
Prefixed the log with "Profile Action:" to help identify it.
|
|
||
| auto& fs = context_.api_.fileSystem(); | ||
| if (!fs.directoryExists(path_)) { | ||
| ENVOY_LOG_MISC(error, "Directory path {} doesn't exist.", path_); |
There was a problem hiding this comment.
Similarly, identify "profile action".
| * Well-known watchdog action names. | ||
| * NOTE: New filters should use the well known name: envoy.watchdog.name | ||
| */ | ||
| class WatchdogActionNameValues { |
There was a problem hiding this comment.
I was under the impression we were doing away with these well-known names classes for extensions?
cc @envoyproxy/maintainers
There was a problem hiding this comment.
Yeah please remove this and just inline the strings. We are trying to make extensions completely self contained.
There was a problem hiding this comment.
Good to know for future extensions. Done.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
| } | ||
|
|
||
| /** | ||
| * Static registration for the fixed heap resource monitor factory. @see RegistryFactory. |
| // Check if there's a tid that justifies profiling | ||
| auto trigger_tid = getTidTriggeringProfile(thread_ltt_pairs); | ||
| if (!trigger_tid.has_value()) { | ||
| ENVOY_LOG_MISC(warn, "None of the provide tids justify profiling"); |
There was a problem hiding this comment.
Prefixed the log with "Profile Action:" to help identify it.
|
|
||
| auto& fs = context_.api_.fileSystem(); | ||
| if (!fs.directoryExists(path_)) { | ||
| ENVOY_LOG_MISC(error, "Directory path {} doesn't exist.", path_); |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
| // We want to make sure the action is tested with both simulated time and real | ||
| // time, to ensure that it works in production, and that it works in the context | ||
| // of integration tests which are much easier to control with simulated time. | ||
| enum class TimeSystemType { Real, Simulated }; |
There was a problem hiding this comment.
Can you provide more details on why this is necessary? I think generally we want tests to use SimulatedTimeSystem to make them deterministic. It seems like this extension creates timers and uses the a TimeSystem to get timestamps for filenames, so it seems to me that using the simulated time should be sufficient for testing.
Furthermore, the docs in test/README.md and test/test_common/test_time.cc, state that GlobalTimeSystem (as used in makeTimeSystem when TimeSystemType == Real) is slated to switch to always creating a SimulatedTimeSystem. At that point, this parameterization will be no longer change the test's behavior.
There was a problem hiding this comment.
Great point, and good to know where moving towards just testing with just SimulatedTimeSystem.
The only benefit likely that we'd get from RealTimeSystem would be to have the profiler to run for a real duration (vs start then more or less stop it) but that shouldn't be the focus of these tests.
Given that, I've removed RealTimeSystem and parameterization.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
| name = "config", | ||
| srcs = ["config.cc"], | ||
| hdrs = ["config.h"], | ||
| security_posture = "robust_to_untrusted_downstream_and_upstream", |
There was a problem hiding this comment.
Technically this should be "data_plane_agnostic", this extension does not process data plane input.
Added a watchdog extension that triggers profiling. Risk Level: Medium (new extension that is optional) Testing: Unit tests Docs Changes: Included (added a reference to the generated extension proto.rst) Release Notes: Included Fixes envoyproxy#11388 Signed-off-by: Kevin Baichoo <kbaichoo@google.com> Signed-off-by: Clara Andrew-Wani <candrewwani@gmail.com>
Commit Message: Added a watchdog extension that triggers profiling.
Additional Description:
Risk Level: Medium (new extension that is optional)
Testing: Unit tests
Docs Changes: Included (added a reference to the generated extension
proto.rst)Release Notes: Included
Fixes #11388