[Cases] Case action: Error handling and retries#173012
[Cases] Case action: Error handling and retries#173012cnasikas merged 83 commits intoelastic:case_actionfrom
Conversation
…o register_case_action
…o register_case_action
7cd3130 to
8af3c0f
Compare
There was a problem hiding this comment.
This file contains the logic as before. The only changes are about throwing errors. Specifically all this.handleAndThrowErrors(restOfErrors); lines plus the handleAndThrowErrors method. Nothing else changed.
There was a problem hiding this comment.
Same tests as before plus some tests to test the retry logic.
| }); | ||
| }); | ||
|
|
||
| describe('Retries', () => { |
There was a problem hiding this comment.
These are the new tests.
| } | ||
|
|
||
| public async run(params: CasesConnectorRunParams) { | ||
| const { alerts, groupingBy } = params; |
There was a problem hiding this comment.
Moved the logic to the CasesConnectorExecutor class.
| } | ||
|
|
||
| return counterLastUpdatedAtAsDate < parsedDate.toDate(); | ||
| await this.retryService.retryWithBackoff(() => this._run(params)); |
There was a problem hiding this comment.
The CasesConnector class is responsible only for the retry logic.
| message: 'An error', | ||
| }; | ||
|
|
||
| export const alerts = [ |
There was a problem hiding this comment.
Moved from the test files.
There was a problem hiding this comment.
The test tests only how the executor is called and the retry logic. All the executor logic moved to cases_connector_executor.test.ts.
There was a problem hiding this comment.
All logic of the executor moved to cases_connector_executor.ts.
| `"Conflict: getting records: mockBulkGetRecords error"` | ||
| ); | ||
|
|
||
| resetCounters(); |
There was a problem hiding this comment.
The generation of the IDs is mocked. It uses counters to get an incremental ID each time an ID is requested. We need to reset the counters before we retry the execution.
|
Pinging @elastic/response-ops (Team:ResponseOps) |
|
Pinging @elastic/response-ops-cases (Feature:Cases) |
| import type { BackoffStrategy, BackoffFactory } from './types'; | ||
|
|
||
| export class CaseConnectorRetryService { | ||
| private maxAttempts: number = 10; |
There was a problem hiding this comment.
nit: does this need to be initialized here if the constructor sets it?
There was a problem hiding this comment.
Good point, probably not 🙂.
| } | ||
|
|
||
| private stop(): void { | ||
| if (this.timer !== null) { |
There was a problem hiding this comment.
nit: I think this check is unnecessary, clearTimeout will just do nothing.
There was a problem hiding this comment.
I did not know about it, thanks!
There was a problem hiding this comment.
Weird when i tried it locally it didn't show anything. nevermind then!
| } | ||
| ); | ||
|
|
||
| it('should succeed if cb does not throws', async () => { |
There was a problem hiding this comment.
| it('should succeed if cb does not throws', async () => { | |
| it('should succeed if cb does not throw', async () => { |
| `"My transient error"` | ||
| ); | ||
|
|
||
| expect(cb).toBeCalledTimes(maxAttempts + 1); |
There was a problem hiding this comment.
The execution goes as:
- The
cbis called for the first time.cbthrows an error. The retry service retries thecb. - The
cbis called for a second time. First rety.cbthrows an error. The retry service retries thecb. - The
cbis called for a third time. Second retry.cbthrows an error. The retry service retries thecb. - The
cbis called for the fourth time. Third retry.cbthrows an error. The retry service does not retry and throws an error.
Basically the first execution of cb does not count as a retry.
| return [validRecords, errors]; | ||
| }; | ||
|
|
||
| export const partitionByNonFoundErrors = <T extends Array<{ statusCode: number }>>( |
There was a problem hiding this comment.
No this is a new function that separates 404 errors from the rest of the other errors.
| expectCasesToHaveTheCorrectAlertsAttachedWithGrouping(casesClientMock); | ||
| }); | ||
|
|
||
| it('attaches the alerts correctly while creating a record and another node has already created it', async () => { |
There was a problem hiding this comment.
Isn't this fundamentally the same as 'attaches the alerts correctly when bulkCreateRecord fails'?
| // conflict error. Another node had updated the record. | ||
| mockBulkUpdateRecord.mockResolvedValueOnce([ | ||
| { | ||
| id: groupedAlertsWithOracleKey[0].oracleKey, | ||
| type: CASE_ORACLE_SAVED_OBJECT, | ||
| message: 'updating records: mockBulkUpdateRecord error', | ||
| statusCode: 409, | ||
| error: 'Conflict', | ||
| }, | ||
| ]); | ||
|
|
||
| await expect(() => | ||
| connectorExecutor.execute({ | ||
| alerts, | ||
| groupingBy, | ||
| owner, | ||
| rule, | ||
| timeWindow, | ||
| reopenClosedCases, | ||
| }) | ||
| ).rejects.toThrowErrorMatchingInlineSnapshot( | ||
| `"Conflict: updating records: mockBulkUpdateRecord error"` | ||
| ); |
There was a problem hiding this comment.
I get that since we are testing retries we want to actually call the same thing(connectorExecutor.execute) twice.
How relevant is that though?
We always have the same chunks in these tests. We mock some error for mockBulk*Record and expect some snapshot for connectorExecutor.execute. Are we really ensuring something or is this just overhead?
I don't know, food for thought. Maybe some integration tests would be more useful.
There was a problem hiding this comment.
The execution of the connector is designed to be as idempotent as it can be. Retrying should not affect the proper execution of the connector, alerts should always attached to the correct cases. In the RFC you can see how the connector is simulated as a state machine where each error and retry leads to the correct state each time, at least in theory. The tests try to simulate multiple nodes making changes to cases at the same time (race conditions) which unfortunately cannot be tested with integration tests. When the system actions are ready we will write a lot of integration tests but for the logic of one node executing the connector.
To the tests now, if you see the mock functions that we are interested in have a chain of mockResolvedValueOnce(...). mockResolvedValueOnce(...). The first mockResolvedValueOnce will be called on the first execution and the second one on the second. This way we try to simulate different responses based on different actions on different nodes. For example, if the first mockResolvedValueOnce fails (conflict), then we retry, and on the second time return different results we test that on the first try another node did some change (increased the counter for example, or created the case for us) and on the second try we get the new state. The new state should not affect the execution of the connector and attach the alerts to the correct case. The snapshot check is to ensure that the correct function threw the error and not another function. After checking the snapshot we check in which case the alerts got attached.
💔 Build FailedFailed CI Steps
Test Failures
Metrics [docs]
HistoryTo update your PR or re-run it, just comment with: cc @cnasikas |
## Summary Depends on: #166267, #170326, #169484, #173740, #173763, #178068, #178307, #178600, #180437 PRs: - #168370 - #169229 - #171754 - #172709 - #173012 - #175107 - #175452 - #175505 - #177033 - #178277 - #177139 - #179796 Fixes: #153837 ## Testing Run Kibana with `--run-examples` if you want to use the "Always firing" rule. Create a rule with a case action in observability and the stack. The security solution is not supported. You should not be able to assign a case action in a security solution rule. 1. Test the "Reopen closed cases" configuration. 2. Test the "Grouping by" configuration. Only one field is allowed. Not all fields are persisted in alerts. If you select a field not part of the alert the case action will create a case where the grouping value is set to `unknow`. 3. Test the "Time window" feature. You can comment out the validation to test for shorter times. 4. Verify that the case action is experimental. 5. Verify that based on the rule type the case is created in the correct solution. 6. Verify that you cannot create a rule with the case action on the basic license. 7. Verify that the execution of the case action fails if you do not have permission for cases. Pending work on the system actions framework level to not allow users to create rules with system actions where they do not have permission. 8. Stress test the case action by creating multiple rules. ### Checklist Delete any items that are not applicable to this PR. - [x] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ## Release notes Automatically create cases when an alert is triggered. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: adcoelho <antonio.coelho@elastic.co> Co-authored-by: Janki Salvi <117571355+js-jankisalvi@users.noreply.github.com>

Summary
This PR:
CasesConnectorErrorerrorCasesConnectorExecutorCasesConnectorclass handle only the retry logic of the connectorDepends on: #172709
Checklist
Delete any items that are not applicable to this PR.
For maintainers