[Heartbeat][Spike] Stateful Errors

This takes over from https://github.com/elastic/kibana/issues/135138#issuecomment-1170409093 . It feels like we're not in a great position WRT using alerts for errors for all the reasons @dominiqueclarke mentioned in that comment. After a discussion we agreed the following approach probably has the best chance of success.  This continues the work I started in https://github.com/elastic/beats/pull/30632 and adapt it.

## Overview

At a high level this approach works by adding additional fields to heartbeat documents to add additional `state` data. A state is defined as a contiguous sequence of events in the same `up` or `down` state. Contiguous errors checks with the same `error.code` value share an `error.id` value comprising a unique error.

Documents would, in addition to their standard fields, contain the following fields at a minimum:

```js
{
  // State ID corresponds to a distinct set of up/down statuses
  "state": {
     "id": "chouerceouhcruoeh", 
     "started_at": "2022-01-01T00:00:00Z",
     "duration_ms": 5m,
     // Total number of pings that succeeded /failed within this state ID
     "up": 0, 
     "down": 4
   }
  // Error ID stays constant so long as an error is present and "error.code" doesn't change
  "error": {"id": 8h98rcoehurcouehrc", "code": "COULD_NOT_CONNECT"}
}
```

## Querying

Querying these fields will require aggregations that find the latest document chronologically for a given state to see how many pings were involved and how long it lasted. Something like the following (note the pipeline aggregation) shows an advanced use case for querying this:

```js
// compute the duration of all down states
{
  size: 0,
  query: {
    match: {"state.up": 0} // don't match up states
  }, 
  aggs: {
    "state_id": {
       "terms": {"field": "state.id", "size": 10000}, // match max number buckets, limits us to 10k states
       "aggs": {
         "duration": {max: {field: "state.duration_ms"}}, // find the most recent doc for the state ID and how long it ran
       }
     },
    "total_duration": {
      "sum_bucket": {
        "buckets_path": "state_id>duration",
      }
    }
  }
}
```

```js
// find all errors and some key fields
{
  size: 0,
  query: {
    match: {"state.up": 0} // don't match up states
  }, 
  aggs: {
    "state_id": {
       "terms": {"field": "error.id", "size": 10000}, // match max number buckets, limits us to 10k states
       "aggs": {
         "error": {
           "top_metrics": {
             "metrics": [
               {"field": "error.code"},
               {"field": "error.message"},
               {"field": "error.type"},
               {"field": "state.duration"},
               {"field": "state.down"}, // number of down pings
             ],
             "sort": {"@timestamp": "desc"}
           }
         }
       }
     },
  }
}
```

## Requirement for persistent storage

This approach requires heartbeat to always know the last state of a monitor. For long running heartbeat processes this state can simply be kept in memory. For heartbeat processes running in ephemeral environments this state must be kept elsewhere. For the cloud service we could store this in a blob store or database TBD. We could re-query elasticsearch to get the last known state as well. It would be ideal to avoid this as this requires additional perms and has security implications.

We could initially build a version of this that works w/o persistent storage, then break that out into a separate issue . I move that not be part of a spike.

## Edge triggered fields with persistent storage

If we can get / guarantee reliable persistence across process restarts we can have additional fields that are edge triggered on state transitions to ease queries. I've listed those below. With these fields a lot of queries become simple since they essentially pre-aggregate all values across all closed state in `state.ending`.

```js
{
  "state": {
   "id": "chouerceouhcruoeh", 
   "started_at": "2022-01-01T00:00:00Z",
   "duration_ms": 5m,
   // Total number of pings that succeeded /failed within this state ID
   "up": 0, 
   "down": 4
    "ending": {
      "id": "id1", // unique UUID for this state, probably time based to be sortable
      "started_at": "1d ago", // when the old state started
      "ended_at": "now", // on transition it ends at the same time as @timestamp
      "duration_ms": "12345",
      "checks": 4, // number of checks in total comprising that state
      "up": 4, // number of up / down checks. Normally one number is zero except in case of flapping state
      "down": 0
    },
   "starting": {
     "id": "id2",
      "started_at": "now",
      "up": 0,
      "down": 1
   }
  }
}
```

## Spike scope

The initial spike should focus on generating the right documents storing state in memory. We should follow-up to figure out the best way to persist / recover state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Heartbeat][Spike] Stateful Errors #32163

Overview

Querying

Requirement for persistent storage

Edge triggered fields with persistent storage

Spike scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Heartbeat][Spike] Stateful Errors #32163

Description

Overview

Querying

Requirement for persistent storage

Edge triggered fields with persistent storage

Spike scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions