[Alerting] remove internal OCC issues with alertsClient#76830
[Alerting] remove internal OCC issues with alertsClient#76830pmuellr wants to merge 1 commit intoelastic:masterfrom
Conversation
207c886 to
4062d23
Compare
During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we remove OCC from methods that were currently using it, namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the `[un]mute[All,Instance]()` methods. Of these methods, OCC is really only _practically_ needed by `update()`, but even for that we don't support OCC in the API, yet; see: issue elastic#74381 . For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be used by PR elastic#75553.
4062d23 to
384d0b6
Compare
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
|
This is ready for review. I also tried an alternative, of wrapping all the AAD-mutating, and thus "full updates required" calls, like this: public async enable({ id }: { id: string }): Promise<void> {
return await retryForConflicts(
this.logger,
`enable(${id})`,
async () => await this.enableBasic({ id })
);
}
private async enableBasic({ id }: { id: string }): Promise<void> {
// ... body of original enable() method here
}where That felt a little gross - we'd have to make sure the "retryable" function didn't leak anything between calls. In practice, it seemed to fix the issues seen in the SIEM tests, once the alert executor started updating the new executionStatus fields. If we feel uncomfortable with the overwriting that's done in the PR at this point, we could switch one or more of the methods to use that technique instead. The |
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
gmmorris
left a comment
There was a problem hiding this comment.
LGTM
Left a note about one tweak I think is worth doing to make sure our ESO and client stay in sync, but other than that all good 👍
| type PartiallyUpdateableAlertAttributes = Partial< | ||
| Pick<RawAlert, AlertAttributesExcludedFromAADType> | ||
| >; |
| export const AlertAttributesExcludedFromAAD = [ | ||
| 'scheduledTaskId', | ||
| 'muteAll', | ||
| 'mutedInstanceIds', | ||
| 'updatedBy', | ||
| ]; |
There was a problem hiding this comment.
It would be worth linking this to the values we pass into ESO here:
That way they can't go out of sync
There was a problem hiding this comment.
I already did - easier to see if you look at the complete file instead of the diff; in particular, line 42 (of course!) - or maybe I'm misunderstanding.
At one point I had all this stuff separated across the two files, decided it would be better to have it all in one place, hopefully the type and attr names array will be easy to keep in sync.
There was a problem hiding this comment.
oh, weird, not sure how I missed that 😳
Thanks Patrick
There was a problem hiding this comment.
TIL something like this should work:
export const AlertAttributesExcludedFromAAD = [
'scheduledTaskId',
'muteAll',
'mutedInstanceIds',
'updatedBy',
];
export type AlertAttributesExcludedFromAADType = typeof AlertAttributesExcludedFromAAD[number];
mikecote
left a comment
There was a problem hiding this comment.
I've been thinking on these changes after having an offline conversation with @pmuellr. I think the approach in this PR will work. It gets a bit tricky in the mute / unmute instance APIs where I think we may have to keep OCC there for now. There's a few other places we could omit values further to keep the partial updates limited to what is requested to change.
I think in the future as we add version support to our APIs, it will enhance the default behaviour introduced in this PR for those who want to make sure no other updates happen concurrently (with the caveat that updating alert status happens on each execution).
| updatedBy: username, | ||
| }, | ||
| { | ||
| version, |
There was a problem hiding this comment.
The argument above could omit partially updatable values since they don't change in this API.
| await this.unsecuredSavedObjectsClient.update('alert', id, { | ||
| ...attributes, | ||
| enabled: true, | ||
| ...this.apiKeyAsAlertAttributes( | ||
| await this.createAPIKey(this.generateAPIKeyName(attributes.alertTypeId, attributes.name)), | ||
| username | ||
| ), | ||
| updatedBy: username, | ||
| }); |
There was a problem hiding this comment.
Could omit other partially updatable attributes (minus enable since we need it here).
| await this.unsecuredSavedObjectsClient.update('alert', id, { | ||
| ...attributes, | ||
| enabled: false, | ||
| scheduledTaskId: null, | ||
| apiKey: null, | ||
| apiKeyOwner: null, | ||
| updatedBy: await this.getUserName(), | ||
| }); |
There was a problem hiding this comment.
Same comment here:
Could omit other partially updatable attributes (minus enable since we need it here).
| await partiallyUpdateAlertSavedObject(this.unsecuredSavedObjectsClient, alertId, { | ||
| mutedInstanceIds, | ||
| updatedBy: await this.getUserName(), | ||
| }); |
There was a problem hiding this comment.
This could be conflicting UX when dropping version. If you call the API to mute two instances, there may only be one instance that ends up getting muted (last update call wins). I think we can keep version for this scenario.
| await partiallyUpdateAlertSavedObject(this.unsecuredSavedObjectsClient, alertId, { | ||
| updatedBy: await this.getUserName(), | ||
| mutedInstanceIds: mutedInstanceIds.filter((id: string) => id !== alertInstanceId), | ||
| }); |
There was a problem hiding this comment.
Same comment here:
This could be conflicting when dropping
version. If you call the API to mute two instances, there may only be one instance that ends up getting muted (last update call wins). I think we can keepversionfor this scenario.
There was a problem hiding this comment.
Ah, I didn't realize the API really only let's you change a single instance in the [un]muteInstance() APIs. So I think you're right, we may get a race condition with customers clicking on mute/unmute of instances.
In this case, I think we should probably go with a story where we keep the existing OCC calls, but wrapper the relevant alertsClient methods so they do a small, fixed number of retries (eg, 5) with a slight delay (eg, 250ms) in the case that they get 409 conflict errors. As noted in this comment #76830 (comment) - I tried out this technique as well and it managed to get the SIEM function tests working, which is where I noticed the problem.
The reason I want to do it that way is that it should avoid having the customers see 409 conflict results because of our updates to the upcoming execution_status fields. Customers would end up having to have their own 409 conflict retry logic around EVERY alert API.
Once we have "version" support for our APIs (or just for update()?), we can bypass the retries.
There was a problem hiding this comment.
++ I like such approach; keeping OCC for now with retry logic. This will be a better experience 👍
The delay could also be a random number between 0 and 250ms in case the API is called at the same time for multiple instances (ex: in a for loop per instance).
|
closing in lieu of #77838, based on some comments here and elsewhere |
Summary
During development of PR #75553, some issues came up with the
optimistic concurrency control (OCC) we were using internally within
the alertsClient, via the
versionoption/property of the savedobject. The referenced PR updates new fields in the alert from the
taskManager task after the alertType executor runs. In some
alertsClient methods, OCC is used to update the alert which are
requested via user requests. And so in some cases, version conflict
errors were coming up when the alert was updated by the task manager
task, in the middle of one of these methods. Note: the SIEM function
test cases stress test this REALLY well.
In this PR, we remove OCC from methods that were currently using it,
namely
update(),updateApiKey(),enable(),disable(), and the[un]mute[All,Instance]()methods. Of these methods, OCC is reallyonly practically needed by
update(), but even for that we don'tsupport OCC in the API itself, yet; see: issue #74381
For cases where we know only attributes not contributing to AAD are
being updated, a new function is provided that does a partial update
on just those attributes, making partial updates for those attributes
a bit safer. That is used by the
*mute*()methods and will be usedby PR #75553.
Note that the remaining methods which won't have OCC -
updateApiKey(),enable()anddisable()(update()will eventually have OCC in theAPI) are doing full updates because they update fields contributing to
AAD. That means it's possible for data to be overwritten without any
conflict notification, if these methods are called at the same time as
other methods.
Checklist
Delete any items that are not applicable to this PR.