Skip to content

[watchdog] Provide additional watchdog actions and/or extension points #11388

@antoniovicente

Description

@antoniovicente

The thread watchdog is already an important mechanism to detect and recover from coding errors that results in infinite loops, blocking API calls and very long computations in worker threads. There are a few simple improvements that would make the watchdog even more awesome:

  • Option to capture a 5sec to 10sec CPU profile after a series of watchdog misses or mega misses, and either write it to disk or make it available via admin interface. If writing to disk, provide parameter for max number of profiles to generate to avoid filling up the disk.
  • Option to capture and log the current stack of the watched thread or all thread stacks on mega miss.
  • Option to terminate the process by sending SIGABRT to the stuck thread instead calling PANIC on the guarddog thread.
  • Registration mechanism for additional callbacks to invoke on watchdog miss or megamiss which could be used to implement some of the prior ideas and/or integrate with third party systems. Callback arguments may include the list of threads that have experienced recent megamiss events and info about when they were last reported alive.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions