[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs

Relates to https://github.com/elastic/kibana/issues/98902#issuecomment-855003936

We have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first.

To combat this, we have a couple options:

### 1. Persist health metrics over time so we are able to query for metrics at certain time periods

This option involves persisting, at a regular interval, the results of the task manager health api in an index, which can be queried using a range filter to determine the metrics at the time of problems occurring.  After some initial thinking, two solutions seem the more obvious:

1. Create/manager our own index and persist the data there (like we do for the event log)
2. Integrate into Stack Monitoring Kibana monitoring indices

The first option is ideal as it gives us complete control over the index, including how often we index.

### 2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

This option involves the task manager self-detecting that it's in a problem state and writing to the event log or the Kibana server log. This gives us the necessary insight, but it's also a little bit reliant on the task manager properly self-reporting status so we have the right logs. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

1. Persist health metrics over time so we are able to query for metrics at certain time periods

2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

Description

1. Persist health metrics over time so we are able to query for metrics at certain time periods

2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions