[Alerting v2] Dispatcher grouping modes, throttle strategies, and matcher autosuggestion#260249
Conversation
## Summary
> [!NOTE]
> Dear reviewers. This PR is getting merged into a feature branch. Only
the ResponseOps review is needed it at the moment. We will request for
your review when we open the feature branch PR to be merged on `main`.
This PR implements the Director component of the alerting v2 core
engine. The Director is an asynchronous state engine responsible for
deriving alert state transitions (e.g., Pending → Active) from the
immutable stream of raw alert events.
## Architecture
### Strategy Pattern
To ensure the Director remains agnostic of specific business logic, we
implemented the Strategy Pattern. The Director facilitates the data
flow, while an `ITransitionStrategy` defines the actual state machine
logic. This allows us to support different transition behaviors, based
on rule configuration or alert event type, without modifying the core
service. It may seem overengineering at the moment, but I think it will
help us in the long run. At the moment, only one strategy is supported,
the `BasicTransitionStrategy`, which moves the states like `inactive ->
pending -> active -> recovering -> inactive` based on a) the status of
the alert event and the latest episode status if exist.
### Episode Lifecycle Management
The state is calculated as:
- **Inactive + Breached → `Pending`**: A new alert has started, but must
wait in Pending before becoming Active.
- **Pending + Recoverde → `Inactive`**: The condition cleared before it
could become Active.
- **Active + Recovered → `Recovering`**: An active alert has stopped
breaching and enters the recovery phase.
- **Recovering + Breached → `Active`**: An alert that was recovering has
breached again.
The episode ID is preserved across pending, active, and recovering
states. A new episode ID is generated only when transitioning from
inactive to a non-inactive state (a new episode starts).
> [!IMPORTANT]
> Calculating the states based on counts or timeframes will be
implemented on the next PR to avoid growing the size of the PR and make
reviewing the fundamentals of the director easier. Same for streaming
the ESQL results to the director and to the datastream.
```mermaid
flowchart LR
subgraph Lifecycle["Episode Lifecycle"]
direction LR
INACTIVE((INACTIVE))
PENDING((PENDING))
ACTIVE((ACTIVE))
RECOVERING((RECOVERING))
INACTIVE -->|"breached<br/>New Episode ID"| PENDING
PENDING -->|breached| ACTIVE
ACTIVE -->|recovered| RECOVERING
RECOVERING -->|recovered| INACTIVE
RECOVERING -->|"breached<br/>"| ACTIVE
PENDING -->|"recovered<br/>Episode Ends"| INACTIVE
end
style INACTIVE fill:#9e9e9e,color:#fff
style PENDING fill:#ffc107,color:#000
style ACTIVE fill:#f44336,color:#fff
style RECOVERING fill:#ff9800,color:#000
```
### Example
**Alert events**
Row | @timestamp | Status | Episode status | Episode ID
-- | -- | -- | -- | --
1 | 10:00 | breached | pending | uuid-1
2 | 10:05 | breached | active | uuid-1
3 | 10:10 | recovered | recovering | uuid-1
4 | 10:15 | recovered | recovered | uuid-1
5 | 10:20 | breached | pending | uuid-2
### Out of scope
- Changed state transitions based on counts or timeframes.
- Streaming of ES|QL results
## Testing
1. Create a rule that fires breach events.
2. Maturation:
- Verify that the alert event documents have the correct episode status,
alert event status, and the episode ID on each run.
Recovering is not possible to be tested atm as we need the rule executor
to produce these alert events.
### Checklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
elastic#251529) resolves elastic/rna-program#106 ## Summary Use QueryService in Dispatcher, and update query to use latest alert-event mapping (episode.id, episode.status). Updated integration tests accordingly.
## Summary
Refactors task registration to use a declarative binding pattern similar
to route registration, making it easier to add new task types.
**Changes:**
- Add `TaskDefinition` token and `AlertingTaskDefinition` interface for
task definitions
- Replace `registerRuleExecutorTaskDefinition()` function with
`RuleExecutorTaskDefinition` constant object
- Create `bind_tasks.ts` with `onActivation` hook for automatic Task
Manager registration
- Add `requiresFakeRequest` option to `TaskRunnerFactory` to support
tasks that don't need API key authentication
### Before
```typescript
// Manual registration in bind_on_setup.ts
registerRuleExecutorTaskDefinition({
taskManager: container.get(PluginSetup('taskManager')),
taskRunnerFactory: container.get(TaskRunnerFactoryToken),
});
```
### After
```typescript
// Declarative binding in bind_tasks.ts
bind(TaskDefinition).toConstantValue(RuleExecutorTaskDefinition);
// Registration happens automatically via onActivation
```
### Adding new task types
```typescript
// 1. Define the task
export const MyTaskDefinition: AlertingTaskDefinition<MyTaskRunner> = {
taskType: 'alerting_v2:my_task',
title: 'My Task',
timeout: '5m',
paramsSchema: schema.object({ /* ... */ }),
taskRunnerClass: MyTaskRunner,
requiresFakeRequest: false, // optional, defaults to true
};
// 2. Bind it
bind(TaskDefinition).toConstantValue(MyTaskDefinition);
```
## Summary This PR skips running the director if the rule is not alertable. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Closes elastic/rna-program#124 This PR creates a `UserService` class to abstract the internals of the security plugin when accessing user information. Usages of `username` were replaced by `profile_uid`. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
…tion_tests/ci_checks
…ic#252073) ## Summary **I am opening this for review, but I am still running some manual tests.** Closes elastic/rna-program#121 This PR changes the resource initializer in `alerting_v2` to use the `@kbn/data-streams` package. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Resolves elastic/rna-program#84 ## Summary > [!NOTE] > I've followed the same approach as the rules API / Rules Client / Rules Saved Object This PR introduces the Notification Policy saved object, the CRUD APIs and a client to use from the API and other services. Added Integration tests on top of uni tests ## Manual testing ```bash curl --request POST \ --url http://localhost:5601/internal/alerting/v2/notification_policies \ --header 'Authorization: Basic <token>' \ --header 'Content-Type: application/json' \ --header 'kbn-xsrf: true' \ --header 'x-elastic-internal-origin: kibana' \ --data '{ "name": "policy-1", "description": "desc", "workflow_id": "my-workflow-id" }' # note <<ID>> curl --request GET \ --url http://localhost:5601/internal/alerting/v2/notification_policies/<ID> \ --header 'Authorization: Basic <token>' \ --header 'Content-Type: application/json' \ --header 'kbn-xsrf: true' \ --header 'x-elastic-internal-origin: kibana' curl --request PUT \ --url http://localhost:5601/internal/alerting/v2/notification_policies/<ID> \ --header 'Authorization: Basic <token>' \ --header 'Content-Type: application/json' \ --header 'kbn-xsrf: true' \ --header 'x-elastic-internal-origin: kibana' \ --data '{ "name": "policy-1-updated", "description": "desc", "workflow_id": "my-workflow-id-2", "version": "doc_version" }' curl --request DELETE \ --url http://localhost:5601/internal/alerting/v2/notification_policies/<ID> \ --header 'Authorization: Basic <token>' \ --header 'kbn-xsrf: true' \ --header 'x-elastic-internal-origin: kibana' ``` --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
mbondyra
left a comment
There was a problem hiding this comment.
Didn't test but presentation code LGTM 👌🏼
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jasonrhodes
left a comment
There was a problem hiding this comment.
This is looking really good. Having manually tested this a lot, my main concerns are:
-
It's still pretty hard to understand what's available in the payload. data.* is tricky, sometimes it's not there (recovery notifications are quite bad at the moment). Rule metadata is missing, only the rule.id is available, which makes it hard to know which rule produced the alert, what the grouping.fields are, etc. without having a lookup on rules in the workflow.
-
It's hard to know what I should expect when it comes to the following things overlaid with each other:
- user-driven suppression (ack, snooze)
- dispatcher de-duplication (don't re-process the same exact rule event twice)
- time-based throttling (at most once every 5 min)
- dispatcher grouping (e.g. digest mode)
I'm not sure we can really fix either of these, or that they are dealbreakers for this PR, so I think I'm okay with it as-is.
I'll hold off on approving just to talk about the circuit-breaker options, those are the most concerning part of merging this now and then doing those in follow up in case the follow ups don't make it into 9.4.
| }); | ||
| } | ||
|
|
||
| groupMap.get(notificationGroupId)!.episodes.push(episode); |
There was a problem hiding this comment.
Is it worth introducing a "maxEpisodesPerGroup" config somewhere to limit how many can go here? Or any other guards around unbounded arrays, unbounded for loops, etc?
Introduces the `observability:alerting:dispatcherEnabled` Kibana Advanced Setting to control whether the dispatcher task processes alert episodes. When disabled, the task still runs on schedule but returns early doing no work, preserving its state so no episodes are missed when re-enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@jasonrhodes Added circuit breaker, default is enabled when alerting feature flag is enabled. |
jasonrhodes
left a comment
There was a problem hiding this comment.
We should make the new setting global and not per-space. When I made these changes, everything was working great.
Note: there is a weird caching bug you can hit when you switch the setting from space to global... https://gist.github.com/jasonrhodes/9e465348625dfe3142fed964d1bbca46
I'll create an issue for that later, for the management team. Not a bug for this PR.
| alertingVTwo: {}, | ||
| })); | ||
|
|
||
| container.get(CoreSetup('uiSettings')).register(dispatcherUiSettings); |
There was a problem hiding this comment.
| container.get(CoreSetup('uiSettings')).register(dispatcherUiSettings); | |
| container.get(CoreSetup('uiSettings')).registerGlobal(dispatcherUiSettings); |
Currently it says it is per-space but that's not even true:
With this change it should show up here:
Note: there is also an associated read-side change I'll add in a sec.
| const uiSettings = get(CoreStart('uiSettings')); | ||
| return async () => { | ||
| const soClient = savedObjects.createInternalRepository(); | ||
| const client = uiSettings.asScopedToClient(soClient); |
There was a problem hiding this comment.
| const client = uiSettings.asScopedToClient(soClient); | |
| const client = uiSettings.globalAsScopedToClient(soClient); |
Read-side change for making this a global advanced setting rather than a space-specific one.
ca55082 to
aa10b40
Compare
| .check((payload) => { | ||
| if (payload.value.throttle === null || payload.value.throttle === undefined) return; | ||
| if (payload.value.groupingMode === undefined) return; | ||
| validateGroupingModeAndStrategy(payload); | ||
| }); |
There was a problem hiding this comment.
🟠 High src/notification_policy_data_schema.ts:165
In updateNotificationPolicyDataSchema, when groupingMode is undefined but throttle.strategy is 'per_status_interval' or 'time_interval', the check returns early on line 167 and skips validateGroupingModeAndStrategy. This allows payloads like { throttle: { strategy: 'per_status_interval' } } to pass validation without the required interval, violating the same rules enforced for createNotificationPolicyDataSchema.
.check((payload) => {
- if (payload.value.throttle === null || payload.value.throttle === undefined) return;
- if (payload.value.groupingMode === undefined) return;
- validateGroupingModeAndStrategy(payload);
+ if (payload.value.throttle === null || payload.value.throttle === undefined) return;
+ validateGroupingModeAndStrategy(payload);
});🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/platform/packages/shared/response-ops/alerting-v2-schemas/src/notification_policy_data_schema.ts around lines 165-169:
In `updateNotificationPolicyDataSchema`, when `groupingMode` is `undefined` but `throttle.strategy` is `'per_status_interval'` or `'time_interval'`, the check returns early on line 167 and skips `validateGroupingModeAndStrategy`. This allows payloads like `{ throttle: { strategy: 'per_status_interval' } }` to pass validation without the required `interval`, violating the same rules enforced for `createNotificationPolicyDataSchema`.
Evidence trail:
x-pack/platform/packages/shared/response-ops/alerting-v2-schemas/src/notification_policy_data_schema.ts lines 162-168 (update schema check with early return at line 164), lines 47-69 (validateGroupingModeAndStrategy with interval check at lines 62-67), line 145 (create schema unconditionally calls validateGroupingModeAndStrategy), line 42 (STRATEGIES_REQUIRING_INTERVAL = ['per_status_interval', 'time_interval'])
| const handleValueChange = (e: React.ChangeEvent<HTMLInputElement>) => { | ||
| const raw = e.target.value; | ||
| if (raw === '') { | ||
| setDurationValue(''); | ||
| onChange(''); | ||
| return; | ||
| } | ||
|
|
||
| const num = parseInt(raw, 10); | ||
| if (!Number.isNaN(num)) { | ||
| setDurationValue(num); | ||
| if (num > 0) { | ||
| onChange(`${num}${durationUnit}`); | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
🟢 Low duration_input/duration_input.tsx:77
When the user enters 0, setDurationValue(0) updates the UI to display 0, but onChange is not invoked because the num > 0 check on line 88 fails. This creates a state inconsistency: the parent component retains the previous value while the UI shows 0. Subsequently, if the user changes the duration unit while 0 is displayed, handleUnitChange also skips calling onChange (line 97 checks durationValue > 0), so the unit change is silently discarded. Consider invoking onChange('') for zero values to keep parent and child state consistent, or prevent 0 from being entered in the first place via the min={1} prop.
const handleValueChange = (e: React.ChangeEvent<HTMLInputElement>) => {
const raw = e.target.value;
if (raw === '') {
setDurationValue('');
onChange('');
return;
}
const num = parseInt(raw, 10);
if (!Number.isNaN(num)) {
setDurationValue(num);
- if (num > 0) {
+ if (num > 0) {
onChange(`${num}${durationUnit}`);
+ } else {
+ onChange('');
}
}
};🤖 Copy this AI Prompt to have your agent fix this:
In file x-pack/platform/plugins/shared/alerting_v2/public/components/notification_policy/form/components/duration_input/duration_input.tsx around lines 77-92:
When the user enters `0`, `setDurationValue(0)` updates the UI to display `0`, but `onChange` is not invoked because the `num > 0` check on line 88 fails. This creates a state inconsistency: the parent component retains the previous value while the UI shows `0`. Subsequently, if the user changes the duration unit while `0` is displayed, `handleUnitChange` also skips calling `onChange` (line 97 checks `durationValue > 0`), so the unit change is silently discarded. Consider invoking `onChange('')` for zero values to keep parent and child state consistent, or prevent `0` from being entered in the first place via the `min={1}` prop.
Evidence trail:
x-pack/platform/plugins/shared/alerting_v2/public/components/notification_policy/form/components/duration_input/duration_input.tsx at REVIEWED_COMMIT:
- Lines 77-89: handleValueChange function shows setDurationValue(num) on line 85 is called for any valid number, but onChange is only called when num > 0 (line 86-88)
- Lines 92-98: handleUnitChange function shows setDurationUnit(newUnit) on line 94 is always called, but onChange is only called when durationValue > 0 (line 95-97)
- Line 117: min={1} prop on EuiFieldNumber doesn't prevent manual typing of 0
jasonrhodes
left a comment
There was a problem hiding this comment.
Changes look good, I've confirmed it works as expected and that the circuit-breaker off switch in global settings works too. I've also confirmed that when the main feature flag is disabled, the global setting is not present, the APIs are not reachable, and the dispatcher task is not running.
LGTM, let's keep evolving this! 🥳
💔 Build Failed
Failed CI StepsTest Failures
Metrics [docs]Module Count
Public APIs missing comments
Any counts in public APIs
Async chunks
Page load bundle
Unknown metric groupsAPI count
References to deprecated APIs
Unreferenced deprecated APIs
History
cc @kdelemme |
…heck * commit '6f040b29a5220ce12886a9731f656613e50aff06': (34 commits) [Entity Analytics] Add entity resolution UI to service flyout (elastic#260504) [Dashboard] Fix setState in embeddables (elastic#260082) [EDR Workflows] Unskip FTR tests that failed due to transient Fleet service unavailability (elastic#260519) [Observability:Streams] Fix query streams error handling test (elastic#260777) [Alerting v2] Dispatcher grouping modes, throttle strategies, and matcher autosuggestion (elastic#260249) [Dashboard] State extraction as a consistent override (elastic#259839) [Alerting v2] [Rule authoring] Fix rule name validation and error visibility in create/edit flow (elastic#260337) [Fix] re-introduce sln breadcrumbs to unified rules (elastic#260289) [Security Solution][Endpoint] Updated kibana docs to include `xpack.securitySolution.maxEndpointScriptFileSize` as configurable in cloud (elastic#260568) [Alerting v2] updated the alerting-v2-constants package with artifacts constants, fix to the runbook max characters (elastic#260342) [Automatic Import V2] Provide user tooltips (elastic#260725) [One Workflow] Deduplicate step types by base type in workflow list (elastic#260763) [Security Solution] Execution results UI: Enable the feature flag (elastic#260711) [Metrics][Discover] internal/search/esql_async returns 200 but METRICS_INFO responds with error (elastic#260746) Collapse redundant anyOf/oneOf array unions in OAS query params (elastic#260585) [Unified rules] Hide stack rules from global search (elastic#260088) [Agent Builder] Sidebar navigation updates (elastic#260728) [* As Code] Use PUT for upserts (elastic#260318) Update EUI to v114.0.0 (elastic#259497) [Entity Resolution] Add contextual-security-apps as co-owner of resolution paths (elastic#260659) ... # Conflicts: # src/platform/plugins/shared/dashboard/public/index.ts
…cher autosuggestion (elastic#260249)

Summary
Redesigns the dispatcher pipeline to separate throttling from grouping, adds matcher field autosuggestion for the notification policy form, and fixes several runtime issues.
Dispatcher redesign: grouping modes & throttle strategies
per_episode(default),all,per_fieldon_status_change,per_status_interval,time_interval,every_timeLAST(episode_status)instead ofBY episode_status)BuildGroupsStepswitches onpolicy.groupingModefor group key computationApplyThrottlingSteprewritten with status-aware throttling forper_episodemodeStoreActionsStepalways emits notified records with realgroup_hashand conditionalepisode_statusruleId— episodes from different rules merge when matched by the same policyruleIdremoved fromNotificationGroupandNotificationPolicyWorkflowPayload(available viaepisodes[*].rule_id)Matcher autosuggestion
MatcherSuggestionsServicefetches field/value suggestions from rules saved objects, alert events ES, and static episode status values/suggestions/data_fieldsand POST/suggestions/valuesAPI routes'notification_policies'abstraction typeDefault throttle interval
5mwhen it becomes visible (on grouping-mode switch, strategy switch, or initial mount)Bug fixes
getLastNotifiedTimestampsQuerywas missingepisode_statusfrom STATS/KEEP, causingon_status_changeandper_status_intervalstrategies to always dispatchgetRuleSoFieldSuggestions— special characters like*were interpreted as wildcards instead of literalsgroupingModenot in payload — now skips mode/strategy compatibility check whengroupingModeis absentDispatchStep— a failing workflow dispatch no longer skips remaining destinations for the same groupepisode_statusnot found on data streams)groupingModesaved object schema to acceptnullvaluesgroupingModefield in test mocks forNotificationPolicyResponsez.string().datetime()→z.iso.datetime()and.superRefine()→.check()Test plan
per_episode+on_status_changebehaviorallgrouping mode — verify episodes from multiple rules merge into one groupper_fieldgrouping mode — verify field-based grouping worksevery_timestrategy dispatches on every tickper_status_intervalstrategy dispatches on status change AND on interval expiryper_status_intervalstrategy — verify repeat interval pre-fills with 5mnode scripts/jest x-pack/platform/plugins/shared/alerting_v2/— all tests pass🤖 Generated with Claude Code