Skip to content

watchdog: Touch Watchdog before executing most event loop callbacks to avoid spurious misses when there are many queued callbacks.#13104

Merged
antoniovicente merged 13 commits intoenvoyproxy:masterfrom
antoniovicente:pet_watchdog
Nov 4, 2020
Merged

Conversation

@antoniovicente
Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente commented Sep 15, 2020

Commit Message: watchdog: Touch Watchdog before executing most event loop callbacks to avoid spurious misses when there are many queued callbacks.
Additional Description:
This change assumes that the primary purposes of the watchdog is to terminate the proxy if worker threads are deadlocked, and aid in the debugging of long-running operations that end up scheduled in worker threads through monitoring for miss/megamiss events and auxiliary watchdog actions like the recently added watchdog profile action.
Risk Level: medium. more frequent calls to the watchdog could have visible performance implications; mitigated by #13103
Testing: unit
Docs Changes: n/a
Release Notes: added
Runtime guard: n/a
Fixes: #11391

… avoid spurious misses when there are many queued callbacks.

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
… problems when execution of the original callback ends up deleting the wrapped callback. The original implementation triggered segfaults when compiling with gcc-10.

Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente antoniovicente changed the title watchdog: Touch Watchdog after executing most event loop callbacks to avoid spurious misses when there are many queued callbacks. watchdog: Touch Watchdog before executing most event loop callbacks to avoid spurious misses when there are many queued callbacks. Sep 15, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Copy link
Copy Markdown
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for implementing this! It will make the watchdog system more useful.

Curious if have any data on the performance impact

@antoniovicente
Copy link
Copy Markdown
Contributor Author

Nice, thanks for implementing this! It will make the watchdog system more useful.

Curious if have any data on the performance impact

I'm not fully up to speed on available OSS performance tests, so no - I don't know about the full performance implications of this change with or without #13103

Copy link
Copy Markdown
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm besides a few nits

I'll defer to senior maintainers regarding any performance tests we have / if we want to measure the impact of this

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@stale
Copy link
Copy Markdown

stale bot commented Oct 4, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 4, 2020
@antoniovicente
Copy link
Copy Markdown
Contributor Author

Back from vacation, waiting until #13103 is merged in.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Oct 12, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
KBaichoo
KBaichoo previously approved these changes Oct 26, 2020
Copy link
Copy Markdown
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Antonio Vicente <avd@google.com>
@KBaichoo
Copy link
Copy Markdown
Contributor

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13104 (comment) was created by @KBaichoo.

see: more, trace.

KBaichoo
KBaichoo previously approved these changes Oct 28, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Antonio Vicente <avd@google.com>
@antoniovicente
Copy link
Copy Markdown
Contributor Author

@envoyproxy/senior-maintainers for final review.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

qq: did we consider just doing the touch as part of the prepare callback that runs before each event loop iteration? Would that be good enough as opposed to touching before every individual event? I know you made touching faster but there is still cost so just wondering if we can get most of the benefit by just doing it in the prepare callback (or maybe both pre-post callbacks for each iteration)?

/wait-any

@antoniovicente
Copy link
Copy Markdown
Contributor Author

Thanks for working on this.

qq: did we consider just doing the touch as part of the prepare callback that runs before each event loop iteration? Would that be good enough as opposed to touching before every individual event? I know you made touching faster but there is still cost so just wondering if we can get most of the benefit by just doing it in the prepare callback (or maybe both pre-post callbacks for each iteration)?

/wait-any

I don't think touching on prepare is enough.

@mattklein123
Copy link
Copy Markdown
Member

I don't think touching on prepare is enough.

OK fair enough!

@antoniovicente antoniovicente merged commit 7be58c8 into envoyproxy:master Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[watchdog] Increase frequency of thread still-alive reports

3 participants