[watchdog] Provide additional watchdog actions and/or extension points

The thread watchdog is already an important mechanism to detect and recover from coding errors that results in infinite loops, blocking API calls and very long computations in worker threads.  There are a few simple improvements that would make the watchdog even more awesome:
- Option to capture a 5sec to 10sec CPU profile after a series of watchdog misses or mega misses, and either write it to disk or make it available via admin interface.  If writing to disk, provide parameter for max number of profiles to generate to avoid filling up the disk.
- Option to capture and log the current stack of the watched thread or all thread stacks on mega miss.
- Option to terminate the process by sending SIGABRT to the stuck thread instead calling PANIC on the guarddog thread.
- Registration mechanism for additional callbacks to invoke on watchdog miss or megamiss which could be used to implement some of the prior ideas and/or integrate with third party systems.  Callback arguments may include the list of threads that have experienced recent megamiss events and info about when they were last reported alive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[watchdog] Provide additional watchdog actions and/or extension points #11388

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[watchdog] Provide additional watchdog actions and/or extension points #11388

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions