[Alerting] formalize alert status and add status fields to alert saved object by pmuellr · Pull Request #75553 · elastic/kibana

pmuellr · 2020-08-20T13:28:56Z

resolves #51099

Summary

This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.

These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
- test to ensure we haven't broken AAD when the status gets updated in the alert
- test to ensure that when the owner of an alert's priv's change so they can no longer write to alerts, the status of any alert they created/updated still have their status updated in the alert (assuming there are cases that the alert would still be running "successfully")

For maintainers

This was checked for breaking API changes and was labeled appropriately

TODO

jest tests for all status values and status error reason values
FT for all status values and status error reason values we can test
AAD tests

pmuellr · 2020-09-08T13:46:54Z

I'm splitting out the alertClients occ/version changes into a separate PR, since it's not directly related to this code, and something that should be looked at independently - #76830 - this PR is now blocked on that PR

During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we remove OCC from methods that were currently using it, namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the `[un]mute[All,Instance]()` methods. Of these methods, OCC is really only _practically_ needed by `update()`, but even for that we don't support OCC in the API, yet; see: issue elastic#74381 . For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be used by PR elastic#75553.

During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occuring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553. interim commits: - add jest tests for partially_update_alert - add jest tests for retry_if_conflicts - in-progress on jest tests for alertsClient with conflicts - get tests for alerts_client_conflict_retries running - add tests for retry logger messages - resolve some PR comments - fix alertsClient test to use version with muteAll/unmuteAll - change test to reflect new use of create api - fix alertsClient meta updates, add comments

dhurley14

I have a branch based on this PR and local testing has looked great. Thanks for this!!! LGTM

pmuellr · 2020-09-30T14:58:39Z

update: na, not flaky - I had pulled in some code that was going to be a merge conflict, but didn't pull enough in; hopefully fixed now in c366d4b

original comment:

This smells a little flaky, though it's odd it in actions which shouldn't be affected by this change.

Guessing it's failing here, for one of the ci errors:

kibana/x-pack/test/alerting_api_integration/spaces_only/tests/actions/execute.ts

Lines 61 to 73 in f993d2d

    
           const reference = `actions-execute-1:${Spaces.space1.id}`; 
        
           const response = await supertest 
        
             .post(`${getUrlPrefix(Spaces.space1.id)}/api/actions/action/${createdAction.id}/_execute`) 
        
             .set('kbn-xsrf', 'foo') 
        
             .send({ 
        
               params: { 
        
                 reference, 
        
                 index: ES_TEST_INDEX_NAME, 
        
                 message: 'Testing 123', 
        
               }, 
        
             }); 
        
           expect(response.status).to.eql(200);

This is a test that uses an indexing action and executes it, and then tries to read the doc that should have gotten created - it seems like maybe the task got delayed, and we should wait a bit before checking?

pmuellr · 2020-09-30T14:59:04Z

jenkins retest this please

YulNaumenko

LGTM

gmmorris

This seems like as solid a solution as we can achieve, given the limitation we have imposed by SOs, so good work on that front. 👍

I've played around with it, tried to fail it, and it seems to be working as expected.

I'm a little concerned that we're not displaying anything in the UI though.
Shouldn't this PR express, at the very least, what Status the Alert is in on the List/Details pages?

gmmorris · 2020-10-01T08:54:17Z

x-pack/plugins/alerts/common/alert.ts

+
+export interface AlertExecutionStatus {
+  status: AlertExecutionStatuses;
+  date: Date;


Could we rename date to something that is explicit about what date it is?
lastExecution? lastUpdate? etc.

it does feel kind of "context free", viewed here, but it is a property of the executionStatus object, so I thought using date by itself would be fine; ie, you're always be referencing this piece of data as executionStatus.date. Adding additional context on date itself seemed like overkill and overly wordy. 🤔

well, it isn't obvious to me what this date actually is 🤷
I won't block on this, but my feeling is that we're adding to the cognitive load by not being clear what his date actually means.
I'm guessing it means "lastUpdate" of the status, but I'm still not 100% sure and to me that's a reason to clarify.

But you know what the status is? Should we change it to lastExecutionStatus as well? :-) I think that was my thinking in trying to not add more context here to the prop names, when it feels like it's implied by it's containing property &| type.

Contextually, how people would end up accessing this, would look like the following for the two variants:

alert.executionStatus.date

alert.executionStatus.lastExecutionDate

I don't have really strong feels, just trying to cut down verbosity / ceremony / overkill where not actually needed.

But I just thought of a decent reason to do this - if for some reason we add some other date to this structure later, THEN it will certainly be confusing what the un-prefixed date would be.

So, I think lastExecutionDate prolly works best for me, since that exactly describes it.

Current branch is using lastExecutionDate - thanks for prodding on this Gidi!

gmmorris · 2020-10-01T08:56:32Z

x-pack/plugins/alerts/server/alerts_client.test.ts

        "enabled": true,
+        "executionStatus": Object {
+          "date": "2019-02-12T21:01:22.479Z",
+          "error": null,


Am I right that we set error to null because of the partial update issue?
I just want to double check I understand, as my instinct would have been to just omit it if there is no error... but obviously, being aware of the partial update issues, I'm assuming it's that. is it worth adding a comment in the code for future devs who might not be aware? 🤔

yup, it's a partial update issue; we need to make sure we remove a previous error if we're doing an update and there is no error this time. So it's typed in the raw alert as "null-able".

I'll add a comment in the raw alert definition for this ...

gmmorris · 2020-10-01T08:58:13Z

x-pack/plugins/alerts/server/alerts_client.ts

    | 'muteAll'
    | 'mutedInstanceIds'
    | 'actions'
+    | 'executionStatus'


nit.
This list has grown a lot, should we switch it to the inverse (using Pick instead of Omit)?
It would mean, by default, new fields are not part of create... 🤷

You're not wrong, but I hate to make changes like this, in a PR like this. It's more of a general clean up thing, and this particular item seems hardly worth an issue by itself. Do we have some general "simple tech debt items" issue we could add this to?

gmmorris · 2020-10-01T08:59:17Z

x-pack/plugins/alerts/server/alerts_client.ts

      muteAll: false,
      mutedInstanceIds: [],
+      executionStatus: {
+        status: 'unknown',


nit. It feels like we should be using enums for these rather than hard coded strings, no?
I get that the wrong string won't pass type checking, but we've used enums for equivalent fields so it feels like we should align.

I kinda look at the "set of these strings" type as being the ceremony-free version of string enums. Is there an explicit value in using enums instead? The upside to not using them is not having to maintain the duplication of the enum keys / values, and not having to have access to that enum type in code that needs it. ie, less ceremony :-)

Would the enum type be exposed in the Alert objects themselves, or just an internal detail? I think we could type it in the Alert and RawAlert as enum safely, but not quite sure. Another reason I generally avoid enums, because I always have to go read the chapter on them in the TS handbook :-)

gmmorris · 2020-10-01T09:06:07Z

x-pack/plugins/alerts/server/alerts_client.ts

+    const rawAlertWithoutExecutionStatus: Partial<Omit<RawAlert, 'executionStatus'>> = {
+      ...rawAlert,
+    };
+    delete rawAlertWithoutExecutionStatus.executionStatus;
+    const executionStatus = alertExecutionStatusFromRaw(rawAlert.executionStatus);


nit.
Could we just extract executionStatus on line 966 like we do with createdAt, meta and scheduledTaskId?
That avoids the need for the delete and the creation of rawAlertWithoutExecutionStatus.
We could also rename rawAlert to be explicit about it not containing the Execution Status if you feel that's important. 🤷

It just feels more in line with the rest of the code there

this was sooooo hard - I tried multiple approaches to this, including doing what you suggested. Not happy with that delete, and I this seemed like the best alternative.

The main problem is the types of the props in the RawAlert and Alert pretty much match up, so that ...rawAlertWithoutExecutionStatus fills in most of the bits without any typing errors. However, the types of the executionStatus in RawAlert and Alert are different. And because we're dealing with partials coming in and going out, it might not be there. So when doing something like what you suggest, you'll end up with an error TS thinks the type of the returned executionStatus is RawAlertExecutionStatus | AlertExecutionStatus, which clearly isn't kosher.

It's worth a comment I think, will help the next person looking at this save a few minutes when they try to fix it. heh

gmmorris · 2020-10-01T09:17:22Z

x-pack/plugins/alerts/server/lib/is_alert_not_found_error.ts

 export function isAlertSavedObjectNotFoundError(err: Error, alertId: string) {
+  // if this is an error with a reason, the actual error needs to be extracted
+  if (isErrorWithReason(err)) {
+    err = err.error;


nit.
This might just be me, but I always find it hard to track the reassignment of function arguments and try to avoid it.
Could we use an intermediary variable? 😬

gmmorris · 2020-10-01T09:32:58Z

x-pack/test/alerting_api_integration/spaces_only/tests/alerting/execution_status.ts

+
+    after(async () => await objectRemover.removeAll());
+
+    it('should be "unknown" for newly created alert', async () => {


I'm not sure about unknown as the status for newly created alerts.
perhaps something that expresses that we know why it doesn't have data yet? The word unknown suggests we don't know why there is no status... that could cause confusion and concern.
This would also help distinguish between a status of unknown and a reason of unknown which are likely to be confused in the future.

Would scheduled or pending work better? Something like that?

long internal discussion on this one :-). Looks like we'll make it something more concrete than unknown, eg pending or similar.

I'm "finalizing" on pending for now, we could still change this before FF next week if needed.

gmmorris · 2020-10-01T09:58:21Z

x-pack/test/alerting_api_integration/spaces_only/tests/alerting/execution_status.ts

+function trues(length: number): boolean[] {
+  return booleans(length, true);
+}
+
+function booleans(length: number, value: boolean): boolean[] {
+  return ''
+    .padStart(length)
+    .split('')
+    .map((e) => value);
+}


Haha, @pmuellr, you know I love how old school you are :)
We now have fill ;)

Suggested change

function trues(length: number): boolean[] {

return booleans(length, true);

}

function booleans(length: number, value: boolean): boolean[] {

return ''

.padStart(length)

.split('')

.map((e) => value);

}

function trues(length: number): boolean[] {

return new Array(number).fill(true);

}

dang, I thought there was something like that now, and couldn't find it!!! I hate writing stuf like this, though it is certainly kinda fun to abuse padStart() and split() this way :-) (I used to do a lot of REXX programming where the only primitive data type was a string (but you could math on it), so we found all kind of "interesting" ways to do things with string functions).

gmmorris · 2020-10-01T09:59:31Z

x-pack/test/alerting_api_integration/spaces_only/tests/alerting/execution_status.ts

+async function delay(millis: number): Promise<void> {
+  await new Promise((resolve) => setTimeout(resolve, millis));
+}


I wish this was just a built in function, we redefine it in every test suite we have 😂

gmmorris · 2020-10-01T10:01:16Z

x-pack/test/alerting_api_integration/spaces_only/tests/index.ts

+
+export async function buildUp(getService: FtrProviderContext['getService']) {
+  const spacesService = getService('spaces');
+  for (const space of Object.values(Spaces)) {
+    if (space.id === 'default') continue;
+
+    const { id, name, disabledFeatures } = space;
+    await spacesService.create({ id, name, disabledFeatures });
+  }
+}
+
+export async function tearDown(getService: FtrProviderContext['getService']) {
+  const esArchiver = getService('esArchiver');
+  await esArchiver.unload('empty_kibana');
+}


I'm confused... why is this here?
Didn't we merge this into Main already? 😮
We might need a fresh merge from Main into the PR... 🤔

oh dang, was going to comment in the PR on this.

There was a merge conflict w/master in this area, and rather than merge master, I just copied the code in here. I was still rebasing at the time, didn't want to have to merge master, but forgot to add a note about it.

I did actually have to merge master this morning and guess what! It's gone!

pmuellr · 2020-10-01T14:01:35Z

@elasticmachine merge upstream

pmuellr · 2020-10-01T14:04:44Z

I'm a little concerned that we're not displaying anything in the UI though.

I was thinking adding something here, but given the current size, and nearness to FF, seemed best not to.

gmmorris

Approving as @pmuellr has confirmed that adding UI is out of scope for this PR and we've discussed renaming unknown on initial creation.

Those were the only blockers for me. 👍

pmuellr · 2020-10-01T18:52:33Z

commit 8775610 has changes from the PR comments, except for the status unknown change, which is still waiting (I did an early stand-alone commit for that before this last one, to see if it caught all the occurances). Will likely make one more commit to the name we decide on, then merge ...

pmuellr · 2020-10-01T19:53:21Z

@elasticmachine merge upstream

… fixes

kibanamachine · 2020-10-01T21:59:01Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 3d68a1e

Metrics [docs]

distributable file count

id	before	after	diff
`default`	45823	45825	+2

page load bundle size

id	before	after	diff
`alerts`	89.3KB	89.8KB	+469.0B

Saved Objects .kibana field count

id	before	after	diff
`alert`	24	30	+6

History

💔 Build #78887 failed 8775610
💚 Build #78755 succeeded 599a72e
💚 Build #78749 succeeded 860647f
💚 Build #78390 succeeded c366d4b
💔 Build #78323 failed 95fcc5a

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…d object (elastic#75553) resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```

…d object (#75553) (#79227) resolves #51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```

I remembered some additional functional tests that should have been in PR elastic#75553 One is to ensure the error field gets cleared from the saved object, after the error status is updated with a non-error status. I always worry a bit about partial updates. The other is to do some negative find tests with the new fields. The existing tests are all positive, but would return the same results if for some reason the filters were ignored (presumably a bug). The negative tests ensure the filters actually filter things out as well. Also a bit of refactoring / cleanup of the tests.

pmuellr added Feature:Alerting v8.0.0 release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.10.0 labels Aug 20, 2020

spong requested a review from dhurley14 August 21, 2020 15:06

pmuellr force-pushed the alerting/status-in-so branch 3 times, most recently from ce43e9a to 079d27c Compare August 27, 2020 01:14

pmuellr force-pushed the alerting/status-in-so branch 3 times, most recently from 206a76b to a429e44 Compare September 2, 2020 20:31

sorenlouv mentioned this pull request Sep 2, 2020

[APM] Investigate alerting event log queries #62711

Closed

pmuellr force-pushed the alerting/status-in-so branch from 037d3c6 to dfba7d5 Compare September 4, 2020 20:32

pmuellr mentioned this pull request Sep 4, 2020

[Alerting] remove internal OCC issues with alertsClient #76830

Closed

1 task

gmmorris mentioned this pull request Sep 7, 2020

[Alerting] Updating Webhook with removed headers doesn't work as expected #71995

Closed

pmuellr force-pushed the alerting/status-in-so branch from dfba7d5 to 17495ad Compare September 9, 2020 13:45

pmuellr mentioned this pull request Sep 17, 2020

[Alerting] retry internal OCC calls within alertsClient #77838

Merged

2 tasks

pmuellr force-pushed the alerting/status-in-so branch from 51008d7 to 51e20a0 Compare September 19, 2020 02:58

dhurley14 approved these changes Sep 30, 2020

View reviewed changes

YulNaumenko approved these changes Sep 30, 2020

View reviewed changes

fix an issue with a prior rebase

c366d4b

pmuellr mentioned this pull request Sep 30, 2020

API to get all active instances from Observability consumers #70169

Closed

gmmorris suggested changes Oct 1, 2020

View reviewed changes

change initial/migration status from unknown to waiting

860647f

Merge branch 'master' into alerting/status-in-so

599a72e

gmmorris approved these changes Oct 1, 2020

View reviewed changes

changes from PR review comments

8775610

elasticmachine and others added 2 commits October 1, 2020 15:53

Merge branch 'master' into alerting/status-in-so

bf71148

change unknown status from waiting to pending, more lastExecutionDate…

3d68a1e

… fixes

pmuellr merged commit 117b577 into elastic:master Oct 1, 2020

pmuellr mentioned this pull request Oct 1, 2020

[7.x] [Alerting] formalize alert status and add status fields to alert saved object (#75553) #79227

Merged

pmuellr added the backported label Oct 2, 2020

pmuellr mentioned this pull request Oct 2, 2020

[Alerting] add more execution status functional tests #79278

Draft

dhurley14 mentioned this pull request Oct 2, 2020

[Security Solution] [Detections] Write failing status when executionStatus is in error #79311

Merged

7 tasks

dhurley14 mentioned this pull request Oct 6, 2020

[Alerting] Write executionStatus property to kibana event log #79785

Closed

pmuellr mentioned this pull request Dec 10, 2020

[Actions] Notify only on action group change #82969

Merged

7 tasks

stefnestor mentioned this pull request Jun 7, 2021

Document Kibana Alert Statuses #101521

Closed


		after(async () => await objectRemover.removeAll());

		it('should be "unknown" for newly created alert', async () => {

Conversation

pmuellr commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

For maintainers

TODO

Uh oh!

pmuellr commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhurley14 left a comment

Choose a reason for hiding this comment

Uh oh!

pmuellr commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmuellr commented Sep 30, 2020

Uh oh!

YulNaumenko left a comment

Choose a reason for hiding this comment

Uh oh!

gmmorris left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmuellr commented Oct 1, 2020

Uh oh!

pmuellr commented Oct 1, 2020

Uh oh!

gmmorris left a comment

Choose a reason for hiding this comment

Uh oh!

pmuellr commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pmuellr commented Aug 20, 2020 •

edited

Loading

pmuellr commented Sep 8, 2020 •

edited

Loading

pmuellr commented Sep 30, 2020 •

edited

Loading

pmuellr commented Oct 1, 2020 •

edited

Loading