[Alerting] formalize alert status and add status fields to alert saved object#75553
[Alerting] formalize alert status and add status fields to alert saved object#75553pmuellr merged 9 commits intoelastic:masterfrom
Conversation
ce43e9a to
079d27c
Compare
206a76b to
a429e44
Compare
037d3c6 to
dfba7d5
Compare
|
I'm splitting out the alertClients occ/version changes into a separate PR, since it's not directly related to this code, and something that should be looked at independently - #76830 - this PR is now blocked on that PR |
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we remove OCC from methods that were currently using it, namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the `[un]mute[All,Instance]()` methods. Of these methods, OCC is really only _practically_ needed by `update()`, but even for that we don't support OCC in the API, yet; see: issue elastic#74381 . For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be used by PR elastic#75553.
dfba7d5 to
17495ad
Compare
51008d7 to
51e20a0
Compare
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we wrap all the methods using OCC with a function that will retry them, a short number of times, with a short delay in between. If the original method STILL has a conflict error, it will get thrown after the retry limit. In practice, this eliminated the version conflict calls that were occuring with the SIEM tests, once we started updating the saved object in the executor. For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be also used by PR elastic#75553. interim commits: - add jest tests for partially_update_alert - add jest tests for retry_if_conflicts - in-progress on jest tests for alertsClient with conflicts - get tests for alerts_client_conflict_retries running - add tests for retry logger messages - resolve some PR comments - fix alertsClient test to use version with muteAll/unmuteAll - change test to reflect new use of create api - fix alertsClient meta updates, add comments
dhurley14
left a comment
There was a problem hiding this comment.
I have a branch based on this PR and local testing has looked great. Thanks for this!!! LGTM
|
update: na, not flaky - I had pulled in some code that was going to be a merge conflict, but didn't pull enough in; hopefully fixed now in c366d4b original comment: This smells a little flaky, though it's odd it in actions which shouldn't be affected by this change. Guessing it's failing here, for one of the ci errors: This is a test that uses an indexing action and executes it, and then tries to read the doc that should have gotten created - it seems like maybe the task got delayed, and we should wait a bit before checking? |
|
jenkins retest this please |
gmmorris
left a comment
There was a problem hiding this comment.
This seems like as solid a solution as we can achieve, given the limitation we have imposed by SOs, so good work on that front. 👍
I've played around with it, tried to fail it, and it seems to be working as expected.
I'm a little concerned that we're not displaying anything in the UI though.
Shouldn't this PR express, at the very least, what Status the Alert is in on the List/Details pages?
|
|
||
| export interface AlertExecutionStatus { | ||
| status: AlertExecutionStatuses; | ||
| date: Date; |
There was a problem hiding this comment.
Could we rename date to something that is explicit about what date it is?
lastExecution? lastUpdate? etc.
There was a problem hiding this comment.
it does feel kind of "context free", viewed here, but it is a property of the executionStatus object, so I thought using date by itself would be fine; ie, you're always be referencing this piece of data as executionStatus.date. Adding additional context on date itself seemed like overkill and overly wordy. 🤔
There was a problem hiding this comment.
well, it isn't obvious to me what this date actually is 🤷
I won't block on this, but my feeling is that we're adding to the cognitive load by not being clear what his date actually means.
I'm guessing it means "lastUpdate" of the status, but I'm still not 100% sure and to me that's a reason to clarify.
There was a problem hiding this comment.
But you know what the status is? Should we change it to lastExecutionStatus as well? :-) I think that was my thinking in trying to not add more context here to the prop names, when it feels like it's implied by it's containing property &| type.
Contextually, how people would end up accessing this, would look like the following for the two variants:
alert.executionStatus.datealert.executionStatus.lastExecutionDate
I don't have really strong feels, just trying to cut down verbosity / ceremony / overkill where not actually needed.
But I just thought of a decent reason to do this - if for some reason we add some other date to this structure later, THEN it will certainly be confusing what the un-prefixed date would be.
So, I think lastExecutionDate prolly works best for me, since that exactly describes it.
There was a problem hiding this comment.
Current branch is using lastExecutionDate - thanks for prodding on this Gidi!
| "enabled": true, | ||
| "executionStatus": Object { | ||
| "date": "2019-02-12T21:01:22.479Z", | ||
| "error": null, |
There was a problem hiding this comment.
Am I right that we set error to null because of the partial update issue?
I just want to double check I understand, as my instinct would have been to just omit it if there is no error... but obviously, being aware of the partial update issues, I'm assuming it's that. is it worth adding a comment in the code for future devs who might not be aware? 🤔
There was a problem hiding this comment.
yup, it's a partial update issue; we need to make sure we remove a previous error if we're doing an update and there is no error this time. So it's typed in the raw alert as "null-able".
There was a problem hiding this comment.
I'll add a comment in the raw alert definition for this ...
| | 'muteAll' | ||
| | 'mutedInstanceIds' | ||
| | 'actions' | ||
| | 'executionStatus' |
There was a problem hiding this comment.
nit.
This list has grown a lot, should we switch it to the inverse (using Pick instead of Omit)?
It would mean, by default, new fields are not part of create... 🤷
There was a problem hiding this comment.
You're not wrong, but I hate to make changes like this, in a PR like this. It's more of a general clean up thing, and this particular item seems hardly worth an issue by itself. Do we have some general "simple tech debt items" issue we could add this to?
| muteAll: false, | ||
| mutedInstanceIds: [], | ||
| executionStatus: { | ||
| status: 'unknown', |
There was a problem hiding this comment.
nit. It feels like we should be using enums for these rather than hard coded strings, no?
I get that the wrong string won't pass type checking, but we've used enums for equivalent fields so it feels like we should align.
There was a problem hiding this comment.
I kinda look at the "set of these strings" type as being the ceremony-free version of string enums. Is there an explicit value in using enums instead? The upside to not using them is not having to maintain the duplication of the enum keys / values, and not having to have access to that enum type in code that needs it. ie, less ceremony :-)
Would the enum type be exposed in the Alert objects themselves, or just an internal detail? I think we could type it in the Alert and RawAlert as enum safely, but not quite sure. Another reason I generally avoid enums, because I always have to go read the chapter on them in the TS handbook :-)
| const rawAlertWithoutExecutionStatus: Partial<Omit<RawAlert, 'executionStatus'>> = { | ||
| ...rawAlert, | ||
| }; | ||
| delete rawAlertWithoutExecutionStatus.executionStatus; | ||
| const executionStatus = alertExecutionStatusFromRaw(rawAlert.executionStatus); |
There was a problem hiding this comment.
nit.
Could we just extract executionStatus on line 966 like we do with createdAt, meta and scheduledTaskId?
That avoids the need for the delete and the creation of rawAlertWithoutExecutionStatus.
We could also rename rawAlert to be explicit about it not containing the Execution Status if you feel that's important. 🤷
It just feels more in line with the rest of the code there
There was a problem hiding this comment.
this was sooooo hard - I tried multiple approaches to this, including doing what you suggested. Not happy with that delete, and I this seemed like the best alternative.
The main problem is the types of the props in the RawAlert and Alert pretty much match up, so that ...rawAlertWithoutExecutionStatus fills in most of the bits without any typing errors. However, the types of the executionStatus in RawAlert and Alert are different. And because we're dealing with partials coming in and going out, it might not be there. So when doing something like what you suggest, you'll end up with an error TS thinks the type of the returned executionStatus is RawAlertExecutionStatus | AlertExecutionStatus, which clearly isn't kosher.
It's worth a comment I think, will help the next person looking at this save a few minutes when they try to fix it. heh
| export function isAlertSavedObjectNotFoundError(err: Error, alertId: string) { | ||
| // if this is an error with a reason, the actual error needs to be extracted | ||
| if (isErrorWithReason(err)) { | ||
| err = err.error; |
There was a problem hiding this comment.
nit.
This might just be me, but I always find it hard to track the reassignment of function arguments and try to avoid it.
Could we use an intermediary variable? 😬
|
|
||
| after(async () => await objectRemover.removeAll()); | ||
|
|
||
| it('should be "unknown" for newly created alert', async () => { |
There was a problem hiding this comment.
I'm not sure about unknown as the status for newly created alerts.
perhaps something that expresses that we know why it doesn't have data yet? The word unknown suggests we don't know why there is no status... that could cause confusion and concern.
This would also help distinguish between a status of unknown and a reason of unknown which are likely to be confused in the future.
Would scheduled or pending work better? Something like that?
There was a problem hiding this comment.
long internal discussion on this one :-). Looks like we'll make it something more concrete than unknown, eg pending or similar.
There was a problem hiding this comment.
I'm "finalizing" on pending for now, we could still change this before FF next week if needed.
| function trues(length: number): boolean[] { | ||
| return booleans(length, true); | ||
| } | ||
|
|
||
| function booleans(length: number, value: boolean): boolean[] { | ||
| return '' | ||
| .padStart(length) | ||
| .split('') | ||
| .map((e) => value); | ||
| } |
There was a problem hiding this comment.
Haha, @pmuellr, you know I love how old school you are :)
We now have fill ;)
| function trues(length: number): boolean[] { | |
| return booleans(length, true); | |
| } | |
| function booleans(length: number, value: boolean): boolean[] { | |
| return '' | |
| .padStart(length) | |
| .split('') | |
| .map((e) => value); | |
| } | |
| function trues(length: number): boolean[] { | |
| return new Array(number).fill(true); | |
| } |
There was a problem hiding this comment.
dang, I thought there was something like that now, and couldn't find it!!! I hate writing stuf like this, though it is certainly kinda fun to abuse padStart() and split() this way :-) (I used to do a lot of REXX programming where the only primitive data type was a string (but you could math on it), so we found all kind of "interesting" ways to do things with string functions).
| async function delay(millis: number): Promise<void> { | ||
| await new Promise((resolve) => setTimeout(resolve, millis)); | ||
| } |
There was a problem hiding this comment.
I wish this was just a built in function, we redefine it in every test suite we have 😂
|
|
||
| export async function buildUp(getService: FtrProviderContext['getService']) { | ||
| const spacesService = getService('spaces'); | ||
| for (const space of Object.values(Spaces)) { | ||
| if (space.id === 'default') continue; | ||
|
|
||
| const { id, name, disabledFeatures } = space; | ||
| await spacesService.create({ id, name, disabledFeatures }); | ||
| } | ||
| } | ||
|
|
||
| export async function tearDown(getService: FtrProviderContext['getService']) { | ||
| const esArchiver = getService('esArchiver'); | ||
| await esArchiver.unload('empty_kibana'); | ||
| } |
There was a problem hiding this comment.
I'm confused... why is this here?
Didn't we merge this into Main already? 😮
We might need a fresh merge from Main into the PR... 🤔
There was a problem hiding this comment.
oh dang, was going to comment in the PR on this.
There was a merge conflict w/master in this area, and rather than merge master, I just copied the code in here. I was still rebasing at the time, didn't want to have to merge master, but forgot to add a note about it.
I did actually have to merge master this morning and guess what! It's gone!
|
@elasticmachine merge upstream |
I was thinking adding something here, but given the current size, and nearness to FF, seemed best not to. |
|
commit 8775610 has changes from the PR comments, except for the status |
|
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]distributable file count
page load bundle size
Saved Objects .kibana field count
History
To update your PR or re-run it, just comment with: |
…d object (elastic#75553) resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```
…d object (#75553) (#79227) resolves #51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```
I remembered some additional functional tests that should have been in PR elastic#75553 One is to ensure the error field gets cleared from the saved object, after the error status is updated with a non-error status. I always worry a bit about partial updates. The other is to do some negative find tests with the new fields. The existing tests are all positive, but would return the same results if for some reason the filters were ignored (presumably a bug). The negative tests ensure the filters actually filter things out as well. Also a bit of refactoring / cleanup of the tests.
resolves #51099
Summary
This formalizes the concept of "alert status", in terms of it's execution, with
some new fields in the alert saved object and types used with the alert client
and http APIs.
These fields are read-only from the client point-of-view; they are provided in
the alert structures, but are only updated by the alerting framework itself.
The values will be updated after each run of the alert type executor.
Checklist
Delete any items that are not applicable to this PR.
For maintainers
TODO