[watchdog] Increase frequency of thread still-alive reports

The current watchdog implementation based on a periodic timer that can run at most once per libevent loop iteration is effective at detecting infinite loops and problematic long running computations, but can result in spurious kill events when under CPU overload if kill threshold is configured too aggressively.  Use of a longer kill threshold is an effective work around, but it increases time to recovery after infinite loops and reduces the watchdog's ability to detect undesirable long pauses in request processing.

More frequent petting together with mechanisms to dump stack on miss/megamiss(https://github.com/envoyproxy/envoy/issues/11388) would improve our ability to detect smaller undesirable pauses and safely reduce miss/megamiss/kill times while making sure that the proxy keeps running in cases where it is able to make progress through the list of outstanding events and do significant useful work.

Changes needed to implement:
- Reduce watchdog pet cost by replacing the use of current time in WatchDogImpl::lastTouchTime() with an atomic counter.
- Update the watchdog last alive time at the guarddog if the touch counter has changed since it was last checked.
- Touch the guarddog before each fd callback, timer callback and from the pre/post libevent loop callbacks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[watchdog] Increase frequency of thread still-alive reports #11391

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[watchdog] Increase frequency of thread still-alive reports #11391

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions