Skip to content

[watchdog] Increase frequency of thread still-alive reports #11391

@antoniovicente

Description

@antoniovicente

The current watchdog implementation based on a periodic timer that can run at most once per libevent loop iteration is effective at detecting infinite loops and problematic long running computations, but can result in spurious kill events when under CPU overload if kill threshold is configured too aggressively. Use of a longer kill threshold is an effective work around, but it increases time to recovery after infinite loops and reduces the watchdog's ability to detect undesirable long pauses in request processing.

More frequent petting together with mechanisms to dump stack on miss/megamiss(#11388) would improve our ability to detect smaller undesirable pauses and safely reduce miss/megamiss/kill times while making sure that the proxy keeps running in cases where it is able to make progress through the list of outstanding events and do significant useful work.

Changes needed to implement:

  • Reduce watchdog pet cost by replacing the use of current time in WatchDogImpl::lastTouchTime() with an atomic counter.
  • Update the watchdog last alive time at the guarddog if the touch counter has changed since it was last checked.
  • Touch the guarddog before each fd callback, timer callback and from the pre/post libevent loop callbacks

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions