Skip to content

Watchdog Extension: Profile Action#12636

Merged
htuch merged 15 commits intoenvoyproxy:masterfrom
KBaichoo:backup-wd-extension-profiler
Aug 27, 2020
Merged

Watchdog Extension: Profile Action#12636
htuch merged 15 commits intoenvoyproxy:masterfrom
KBaichoo:backup-wd-extension-profiler

Conversation

@KBaichoo
Copy link
Copy Markdown
Contributor

@KBaichoo KBaichoo commented Aug 13, 2020

Commit Message: Added a watchdog extension that triggers profiling.
Additional Description:
Risk Level: Medium (new extension that is optional)
Testing: Unit tests
Docs Changes: Included (added a reference to the generated extension proto.rst)
Release Notes: Included
Fixes #11388

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12636 was opened by KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Copy Markdown
Contributor Author

/assign @akonradi PTAL Alex

@repokitteh-read-only
Copy link
Copy Markdown

neither of PTAL, Alex can be assigned to this issue.

🐱

Caused by: a #12636 (comment) was created by @KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Copy Markdown
Contributor Author

/assign @akonradi

PTAL Alex

@KBaichoo
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Cannot retry non-completed check: envoy-presubmit (Linux-x64 compile_time_options), please wait.

🐱

Caused by: a #12636 (comment) was created by @KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Copy Markdown
Contributor Author

/retest-circle

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: docs (failed build)

🐱

Caused by: a #12636 (comment) was created by @KBaichoo.

see: more, trace.

string profile_path = 2 [(validate.rules).string = {min_bytes: 1}];

// Limits the max number of profiles that will be written to disk by a given
// thread to avoid filling disks. If not set (i.e. it's 0), a default of 10
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? E.g. is this a thread lifetime maximum, or does it apply each time it is triggered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a thread lifetime maximum. I updated the comment to clarify the intent.

// Limits the max number of profiles that will be written to disk by a given
// thread to avoid filling disks. If not set (i.e. it's 0), a default of 10
// will be used.
uint64 max_profiles_per_thread = 3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly you might want to use UInt64Value, but only if there will ever be a possible sensible setting for 0 here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this is an optional extension, if you're including it that is at least some intent to use it.

Otherwise it'd just be an unnecessary no-op.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent here was max_profiles to collect per process, not max per stuck thread :-/

Copy link
Copy Markdown
Contributor

@akonradi akonradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments before moving on to tests.

envoy::extensions::watchdog::profile_action::v3alpha::ProfileActionConfig& config,
Server::Configuration::GuardDogActionFactoryContext& context)
: path_(config.profile_path()), running_profile_(false),
max_profiles_per_tid_(config.max_profiles_per_thread() ? config.max_profiles_per_thread()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the implicit conversion from uint64 to bool is correct, but can be error prone. Prefer to check config.max_profiles_per_thread() == 0 and let the compiler optimize.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Server::Configuration::GuardDogActionFactoryContext& context);

void run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::WatchdogEvent event,
std::vector<std::pair<Thread::ThreadId, MonotonicTime>> thread_ltt_pairs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this method is from the GuardDogAction interface, but asking here since you authored that too 😄 : why is this std::vector passed by copy instead of by const reference?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I do think const ref might be a better way to go about this.


auto& fs = context_.api_.fileSystem();
if (!fs.directoryExists(path_)) {
ENVOY_LOG_MISC(error, "Directory path {} doesn't exists.", path_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/exists/exist

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

tid_to_profile_count_[*trigger_tid] += 1;

// Schedule callback to stop
timer_cb_ = context_.dispatcher_.createTimer([this, profile_filename] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating timers is expensive. Can we create the timer once and change the profile_filename that it references via a member variable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. Done.

if (Profiler::Cpu::profilerEnabled()) {
Profiler::Cpu::stopProfiler();
running_profile_ = false;
timer_cb_->disableTimer();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this call. It looks like the timer is only enabled here, and timers are always one-shot, so the timer should be disabled after this callback runs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch.

absl::optional<Thread::ThreadId> tid_to_profile;

// Find a TID not over the max_profiles threshold
for (const auto& tid_time_pair : thread_ltt_pairs) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Envoy is using C++17 now, so we can use tuple destructuring here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Copy Markdown
Contributor

@akonradi akonradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Just a few more nits on my end

Comment on lines +119 to +125
TestUtility::loadFromJson(absl::Substitute(R"EOF({
"profile_duration": "1s",
"profile_path": "$0",
}
)EOF",
test_path_),
config);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be more succinctly spelled out by just calling the setters on the config object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +105 to +106
tid_to_profile = tid;
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be return tid; and return absl::nullopt; at the bottom? That lets you get rid of the extra variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +151 to +170
// Check we can do multiple profiles
dispatcher_->post([&tid_ltt_pairs, &now, this]() -> void {
action_->run(envoy::config::bootstrap::v3::Watchdog::WatchdogAction::MISS, tid_ltt_pairs, now);
absl::MutexLock lock(&mutex_);
outstanding_notifies_ += 1;
});

waitForOutstandingNotify();
time_system_->advanceTimeWait(std::chrono::seconds(2));

dispatcher_->exit();
thread_->join();

#ifdef PROFILER_AVAILABLE
EXPECT_EQ(countNumberOfProfileInPath(test_path_), 2);
#else
// Profiler won't run in this case, so there should be no files generated.
EXPECT_EQ(countNumberOfProfileInPath(test_path_), 0);
#endif
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a separate behavior than the single-profile case. Could/should this be a separate test? If not, should the EXPECT_EQ above be asserts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will split out the single profile case from this.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…on-profiler

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo KBaichoo requested a review from yanavlasov August 18, 2020 14:43
@KBaichoo
Copy link
Copy Markdown
Contributor Author

cc @antoniovicente

@KBaichoo
Copy link
Copy Markdown
Contributor Author

ping @akonradi

Copy link
Copy Markdown
Contributor

@akonradi akonradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just nits now

Comment on lines +81 to +85
if (GetParam() == TimeSystemType::Real) {
return std::make_unique<Event::GlobalTimeSystem>();
}
ASSERT(GetParam() == TimeSystemType::Simulated);
return std::make_unique<Event::SimulatedTimeSystem>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a switch statement will get you the same safety properties but at compile time instead of run time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

void
setupAction(envoy::extensions::watchdog::profile_action::v3alpha::ProfileActionConfig& config) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider inlining this. It will make your tests a little more verbose but since they often access action_ directly, it's helpful to see that being set instead of hiding it behind a method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It more the test body is more complete this way, and this wasn't really hiding much.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Copy Markdown
Contributor Author

@lizan since you're the on-call maintainer, can you dispatch this who the right person is to review this PR? Thanks!

…to the constructor.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
…efactoring the timer creation to the ctor.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
yanavlasov
yanavlasov previously approved these changes Aug 22, 2020
@KBaichoo
Copy link
Copy Markdown
Contributor Author

PTAL @envoyproxy/api-shepherds

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

@htuch
Copy link
Copy Markdown
Member

htuch commented Aug 24, 2020

Need a master merge @KBaichoo

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Copy Markdown
Contributor Author

PTAL @envoyproxy/senior-maintainers

Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Generally looks good. I had one question for the other maintainers about the well-known names class, but otherwise small stuff.

}

/**
* Static registration for the fixed heap resource monitor factory. @see RegistryFactory.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment refers to the wrong kind of extension.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

// Check if there's a tid that justifies profiling
auto trigger_tid = getTidTriggeringProfile(thread_ltt_pairs);
if (!trigger_tid.has_value()) {
ENVOY_LOG_MISC(warn, "None of the provide tids justify profiling");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: provided

Also, since you're using the misc logger, perhaps this string should contain "profile action" to help someone reading the logs understand what's being warned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefixed the log with "Profile Action:" to help identify it.


auto& fs = context_.api_.fileSystem();
if (!fs.directoryExists(path_)) {
ENVOY_LOG_MISC(error, "Directory path {} doesn't exist.", path_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, identify "profile action".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Well-known watchdog action names.
* NOTE: New filters should use the well known name: envoy.watchdog.name
*/
class WatchdogActionNameValues {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression we were doing away with these well-known names classes for extensions?

cc @envoyproxy/maintainers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah please remove this and just inline the strings. We are trying to make extensions completely self contained.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know for future extensions. Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Copy Markdown
Contributor Author

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @zuercher.

}

/**
* Static registration for the fixed heap resource monitor factory. @see RegistryFactory.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

// Check if there's a tid that justifies profiling
auto trigger_tid = getTidTriggeringProfile(thread_ltt_pairs);
if (!trigger_tid.has_value()) {
ENVOY_LOG_MISC(warn, "None of the provide tids justify profiling");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefixed the log with "Profile Action:" to help identify it.


auto& fs = context_.api_.fileSystem();
if (!fs.directoryExists(path_)) {
ENVOY_LOG_MISC(error, "Directory path {} doesn't exist.", path_);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
// We want to make sure the action is tested with both simulated time and real
// time, to ensure that it works in production, and that it works in the context
// of integration tests which are much easier to control with simulated time.
enum class TimeSystemType { Real, Simulated };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details on why this is necessary? I think generally we want tests to use SimulatedTimeSystem to make them deterministic. It seems like this extension creates timers and uses the a TimeSystem to get timestamps for filenames, so it seems to me that using the simulated time should be sufficient for testing.

Furthermore, the docs in test/README.md and test/test_common/test_time.cc, state that GlobalTimeSystem (as used in makeTimeSystem when TimeSystemType == Real) is slated to switch to always creating a SimulatedTimeSystem. At that point, this parameterization will be no longer change the test's behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, and good to know where moving towards just testing with just SimulatedTimeSystem.

The only benefit likely that we'd get from RealTimeSystem would be to have the profiler to run for a real duration (vs start then more or less stop it) but that shouldn't be the focus of these tests.

Given that, I've removed RealTimeSystem and parameterization.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit c88515f into envoyproxy:master Aug 27, 2020
name = "config",
srcs = ["config.cc"],
hdrs = ["config.h"],
security_posture = "robust_to_untrusted_downstream_and_upstream",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this should be "data_plane_agnostic", this extension does not process data plane input.

clarakosi pushed a commit to clarakosi/envoy that referenced this pull request Sep 3, 2020
Added a watchdog extension that triggers profiling.

Risk Level: Medium (new extension that is optional)
Testing: Unit tests
Docs Changes: Included (added a reference to the generated extension proto.rst)
Release Notes: Included

Fixes envoyproxy#11388

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Clara Andrew-Wani <candrewwani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[watchdog] Provide additional watchdog actions and/or extension points

7 participants