Skip to content

[ResponseOps][Alerting] Alerting v2#247464

Merged
darnautov merged 258 commits intomainfrom
alerting_v2
Mar 27, 2026
Merged

[ResponseOps][Alerting] Alerting v2#247464
darnautov merged 258 commits intomainfrom
alerting_v2

Conversation

@cnasikas
Copy link
Copy Markdown
Member

@cnasikas cnasikas commented Dec 24, 2025

Summary

Key capabilities

  • ES|QL-native rule evaluation — Rules are defined as ES|QL queries with optional WHERE clause conditions, evaluated on a configurable schedule
  • Alert lifecycle management — Full episode tracking with pending → active → recovering → inactive state transitions, including configurable alert delay (consecutive breaches / duration)
  • Event-driven architecture — Alert events and actions are stored in dedicated data streams (.alerting-events, .alerting-actions) with ES|QL views for querying
  • Notification dispatch pipeline — A multi-step dispatcher that matches alert episodes to notification policies, handles throttling/suppression, and triggers Kibana Workflows using encrypted API keys
  • Notification policies — CRUD APIs and UI for creating notification policies with KQL-based rule matching, workflow integration, and API key management
  • Rule authoring UI — A shared rule form package (@kbn/alerting-v2-rule-form) usable standalone or embedded in Discover, with ES|QL editor, WHERE clause condition editing, recovery configuration, and live query preview
  • Rule management UI — Full rule list with pagination, enable/disable, clone, edit, and delete operations
  • APM instrumentation — Middleware and decorators for tracing rule execution and client operations

Architecture highlights

  • InversifyJS DI — All services use constructor injection with typed tokens, scoped per-request or singleton as appropriate
  • Pipeline pattern — Rule executor and dispatcher use composable step-based pipelines
  • Saved Objects — Rules stored as hidden saved objects; notification policies stored as encrypted saved objects (for API key protection)
  • Feature privileges — Dedicated Kibana feature with read/all privileges for RBAC

Contained PRs

Core Engine & Plugin Init (12 PRs)
Rule Execution Pipeline (12 PRs)
Alert Suppression & Episodes (3 PRs)
Dispatcher & Notification Engine (6 PRs)
Notification Policies (Server) (4 PRs)
Notification Policies UI (1 PR)
Rule Authoring UI (13 PRs)
API Documentation & Schema (2 PRs)
Observability & Monitoring (3 PRs)
CI & Maintenance (2 PRs)

@cnasikas cnasikas self-assigned this Dec 24, 2025
@cnasikas cnasikas added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v9.4.0 labels Dec 24, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 24, 2025

🔍 Preview links for changed docs

## Summary

Adds the following API endpoints:
- GET rule by id
- GET rules (paginated)
- DELETE rule


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
cnasikas and others added 2 commits January 12, 2026 12:13
## Summary

This PR creates the `Logger`, `ESQL`, and `Storage` services to abstract
various functionality that is needed by the main components of the new
alerting engine.

### Out of scope
- Retries
- Error handling

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Summary

> [!NOTE]
> Dear reviewers, this PR is getting merged into a feature branch. Only
the ResponseOps review is needed it at the moment. We will request for
your review when we open the feature branch PR to be merged on `main`.

This PR initializes all resources needed for the alerting new engine.

## Test
Ensure that the `.alert-events`, `.alert-transitions`, and
`.alert-actions` datastreams are created correctly, along with their ILM
policies. Also, ensure that restarting Kibana after the creation of the
datastreams do not produce errors.

Fixes: elastic/rna-program#72

### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
darnautov and others added 5 commits March 25, 2026 18:53
Keep upstream/main formatting and merge logCancelEvent addition
from the feature branch into the task cancel handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to export all of this things?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately can't see what this comment is about. I'll reach out in Slack, maybe we can review live over Zoom.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've removed the exports in https://github.com/elastic/kibana/pull/259649/commits, which is now merged into this PR.

Copy link
Copy Markdown
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obs-exploration changes LGTM!

yiannisnikolopoulos and others added 7 commits March 26, 2026 17:02
…ule form duration fields (#258526)

## Summary

Adds maximum duration constraints to all duration fields in the alerting
v2 rule form, closing a validation gap where a direct API call could
submit arbitrarily large values that would pass server-side validation
undetected. Also enforces a minimum schedule interval and warns when the
lookback window is shorter than the execution interval.

**Problem:** Duration fields (`schedule.every`, `schedule.lookback`,
`state_transition.pending_timeframe`,
`state_transition.recovering_timeframe`) were only validated for format
correctness (e.g. `"5m"`, `"1h"`). There was no upper or lower bound, so
a malformed or malicious API payload could submit values like `"9999d"`
or `"1ms"` and they would be accepted by both the client and server.

**Solution:**

*Server-side (`@kbn/alerting-v2-schemas`):*
- Introduce `MAX_DURATION = '365d'` and `MIN_SCHEDULE_INTERVAL = '1m'`
constants in `constants.ts`
- Add `validateMaxDuration(value, max)`, `validateMinDuration(value,
min)`, and `parseDurationToMs(value)` to `validation.ts`, converting
duration strings to milliseconds for comparison
- Update `durationSchema` in `common.ts` to enforce `MAX_DURATION` on
every duration field automatically
- Add `scheduleEverySchema` in `rule_data_schema.ts` to additionally
enforce `MIN_SCHEDULE_INTERVAL` on `schedule.every`

*Client-side (`alerting-v2-rule-form`):*
- `ScheduleField` validates both min (`1m`) and max (`365d`) on
`schedule.every`
- `LookbackWindowField` validates max (`365d`) on `schedule.lookback`
and shows a plain-text help warning when the lookback is shorter than
the schedule interval
- `StateTransitionTimeframeField` validates max (`365d`) on
`pending_timeframe` and `recovering_timeframe`
- All client validation reuses the same functions from
`@kbn/alerting-v2-schemas`, keeping client and server in sync

The value `365d` was chosen to be intentionally generous: no legitimate
rule would need a schedule or timeframe exceeding one year. It is easier
to relax a constraint later than to introduce one after the fact (which
would be a breaking change for rules already storing larger values).

The decision for these changes has been captured in
[this](https://github.com/elastic/rna-program/blob/main/docs/decisions/2026-03-19-duration-field-validation-limits.md)
doc.

## Test plan

- [x] Unit tests pass for all modified files (128 tests across schema
and validation suites)
- [x] Dedicated `validation.test.ts` suite covering `parseDurationToMs`,
`validateMaxDuration`, and `validateMinDuration` including cross-unit
cases (`55w` exceeds `365d`, `500m` is within limit)
- [x] `schedule.every` rejects values below `1m` (e.g. `30s`, `59s`) and
above `365d` in both `createRuleDataSchema` and `updateRuleDataSchema`
- [x] `schedule.lookback`, `pending_timeframe`, and
`recovering_timeframe` reject values exceeding `365d`
- [ ] Manually verify: enter `30s` in Schedule field and confirm min
error appears
- [ ] Manually verify: enter `9999d` in Schedule / Lookback / Pending
timeframe / Recovering timeframe fields and confirm max error appears
- [ ] Manually verify: set Lookback window shorter than Schedule
interval and confirm warning text appears below the field

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
…nu (#259649)

## Summary

Refactors how the v2 "Rules" menu in Discover handles legacy (v1) rule
types, replacing the `legacyRules` registry indirection with a simpler
merge-and-delete approach.

**Before:** A new `AppMenuActionId.legacyRules` was introduced as a
hidden container. Profile extensions had to dual-register items to both
`alerts` and `legacyRules`. The `legacyRules` container caused an empty
menu item bug in overflow popovers.

**After:** Profile extensions register to `alerts` as they always have.
When v2 is enabled, items from `alerts` are merged into the `createRule`
menu's `legacy-rules` submenu, then `alerts` is removed. No new registry
IDs, no dual registration, no empty menu items.

### Changes

- **`app_menu_registry.ts`**: Added `deleteItem()` method to remove menu
items by ID
- **`types.ts`**: Removed `AppMenuActionId.legacyRules` enum value
- **`get_create_rule.tsx`**: Removed hardcoded legacy items from the v2
menu (empty `legacy-rules` submenu is populated by merge). Updated
labels to "Create v2 ES|QL rule" and "Create v1 rules"
- **`use_top_nav_links.tsx`**: Removed `legacyRules` container
registration. Changed merge source from `legacyRules` to `alerts`. Added
`deleteItem` call to remove `alerts` after merge. Always register
`alerts` menu (not gated by `showLegacyAlerts`)
- **`get_app_menu.tsx` (observability profile)**: Removed duplicate
`registerPopoverItem` calls to `legacyRules`, keeping only `alerts`
registration

### Net effect: -88 lines, +18 lines

## Test plan

- [ ] Enable alerting v2 flag, go to Discover → ES|QL mode
- [ ] Verify "Rules" menu shows "Create v2 ES|QL rule" and "Create v1
rules" submenu
- [ ] Verify "Create v1 rules" submenu contains items from profile
extensions (e.g. search threshold, custom threshold for o11y)
- [ ] Verify no empty menu items appear in the overflow popover
- [ ] Disable alerting v2 flag, verify "Alerts" menu appears as before
with no regressions
- [ ] Test in Observability profile to confirm custom threshold + SLO
actions appear correctly


Made with [Cursor](https://cursor.com)

---------

Co-authored-by: Dominique Belcher <dominique.clarke@elastic.co>
## Summary

Fixes a regression introduced by #258526, which added a
`MIN_SCHEDULE_INTERVAL = '1m'` server-side validation on
`schedule.every`. This broke the Scout `episode_lifecycle` tests that
create rules with a `5s` schedule interval — the API now rejects those
requests, causing the "should track multiple groups independently" test
(and others in the suite) to fail.

**Fix:** Lower `MIN_SCHEDULE_INTERVAL` from `'1m'` to `'5s'` in
`constants.ts`. Update corresponding unit tests to reflect the new
boundary.

**Test removal:** Removes one `createRuleDataSchema` test case ("rejects
schedule.every below 1m via cross-unit (59s)") that was meant to
validate cross-unit comparison but wasn't doing that for 1m so was an
unnecessary test before and after these PR changes. (if the min value
had been 2m, you could test failures at 1m and at 95s, and that's
cross-unit testing ... but when the value is 1m and the lowest unit we
allow is seconds, there's no cross-unit test possible).

A follow-up issue will be created to discuss whether `5s` is the right
long-term minimum or if this should be configurable per environment.

## Test plan

- [x] `rule_data_schema.test.ts` — 95 tests pass
- [x] `validation.test.ts` — 48 tests pass

Made with [Cursor](https://cursor.com)
@darnautov
Copy link
Copy Markdown
Contributor

hi @AlexGPlay, could you please take another look?

Reverts the notification policy mapping/schema/test changes from 49f1ce6
while preserving the alerting_rule mapping fixes that align with the
model version create schema (recovery_policy, no_data, artifacts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@jughosta jughosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data Discovery changes LGTM as they are behind the feature flag still.

Copy link
Copy Markdown
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davismcphee
Copy link
Copy Markdown
Contributor

Hi all, leaving this note here for future reference. We approved on the Data Discovery side because the changes do not actively cause any regressions with the feature flag tuned off. However, we do not approve of the current UX and are not ok with the feature flag being turned on by default until it's addressed. Some quick points of feedback from our end:

  • There should be no new top-level "Rules" app menu item, it must remain within the overflow menu as the existing one is. This was an explicit design decision we made after a lot of deliberation about the app menu, and any changes to it will need to go through the proper channels.
  • We don't like the split between new and legacy menu items. We should figure out a way to clean this up instead of exposing these technical details to end users. Someone on our end put together a very quick idea of how this might be able to work for inspiration:

Along with these changes, we'll probably ask that a portion of the code changes in our area are also reverted to their previous state. For example, without the duplicate app menu items, the changes to the app menu registry amount to tech debt for us, and make one our public APIs messy for consumers.

I appreciate there's some urgency to get this merged since it's so large, but it's not ideal from our end, and I just want to make sure we're all on the same page about expectations post-merge.

@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Mar 27, 2026

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #7 / getColumns rule column renders rule name as a PreviewLink when hidePreviewLink is false
  • [job] [logs] Jest Tests #7 / getColumns rule column renders rule name as a PreviewLink when hidePreviewLink is false
  • [job] [logs] Jest Tests #7 / getColumns rule column renders rule name as plain text when user cannot read rules
  • [job] [logs] Jest Tests #7 / getColumns rule column renders rule name as plain text when user cannot read rules
  • [job] [logs] FTR Configs #35 / Reporting Generate CSV from SearchSource validation Searches large amount of data, stops at Max Size Reached

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
alertingVTwo - 502 +502
apm 2066 2104 +38
canvas 1401 1447 +46
cloudSecurityPosture 620 768 +148
datasetQuality 1134 1135 +1
dataVisualizer 876 877 +1
discover 2010 2011 +1
esql 974 975 +1
esqlDataGrid 289 437 +148
eventAnnotationListing 702 703 +1
fleet 1527 1674 +147
infra 1833 1834 +1
lens 1719 1720 +1
logsShared 365 513 +148
ml 4161 4162 +1
observability 1788 1919 +131
observabilityLogsExplorer 106 254 +148
osquery 605 752 +147
searchPlayground 445 593 +148
securitySolution 9299 9300 +1
slo 1296 1297 +1
streamsApp 1797 1798 +1
triggersActionsUi 1330 1331 +1
unifiedDocViewer 942 943 +1
workflowsManagement 1574 1721 +147
total +1911

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/alerting-v2-rule-form - 60 +60
@kbn/alerting-v2-schemas - 69 +69
@kbn/discover-utils 372 378 +6
@kbn/yaml-rule-editor - 12 +12
alertingVTwo - 9 +9
discover 149 150 +1
observability 651 652 +1
taskManager 74 80 +6
total +164

Any counts in public APIs

Total count of every any typed public API. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats any for more detailed information.

id before after diff
@kbn/alerting-v2-rule-form - 2 +2
@kbn/alerting-v2-schemas - 22 +22
total +24

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
alertingVTwo - 334.6KB ⚠️ +334.6KB
discover 1.6MB 1.6MB +3.9KB
observability 2.0MB 2.2MB +166.4KB
total +504.9KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/alerting-v2-rule-form - 7 +7
@kbn/alerting-v2-schemas - 1 +1
taskManager 10 11 +1
total +9

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alertingVTwo - 140.0KB +140.0KB
discover 26.7KB 26.7KB +5.0B
observability 97.9KB 98.6KB +769.0B
observabilityShared 69.6KB 69.8KB +148.0B
serverlessObservability 17.1KB 17.4KB +325.0B
total +141.3KB
Unknown metric groups

API count

id before after diff
@kbn/alerting-v2-rule-form - 102 +102
@kbn/alerting-v2-schemas - 84 +84
@kbn/discover-utils 444 458 +14
@kbn/yaml-rule-editor - 22 +22
alertingVTwo - 9 +9
discover 203 204 +1
observability 659 660 +1
taskManager 122 128 +6
total +239

async chunk count

id before after diff
alertingVTwo - 9 +9
observability 25 30 +5
total +14

ESLint disabled in files

id before after diff
@kbn/alerting-v2-episodes-ui - 1 +1

ESLint disabled line counts

id before after diff
@kbn/alerting-v2-rule-form - 1 +1
alertingVTwo - 2 +2
total +3

References to deprecated APIs

id before after diff
@kbn/discover-utils 0 1 +1
alertingVTwo - 9 +9
total +10

Total ESLint disabled count

id before after diff
@kbn/alerting-v2-episodes-ui - 1 +1
@kbn/alerting-v2-rule-form - 1 +1
alertingVTwo - 2 +2
total +4

Unreferenced deprecated APIs

id before after diff
@kbn/discover-utils 0 1 +1
alertingVTwo - 9 +9
total +10

History

cc @kdelemme @adcoelho @darnautov @cnasikas @dominiqueclarke

@PhilippeOberti
Copy link
Copy Markdown
Contributor

@darnautov the test failing on your branch was actually failing in main and was fixed by this PR. If you rebase/merged against latest main you should be fine 🤞

@darnautov
Copy link
Copy Markdown
Contributor

@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting closes:rna PR closes an issue on the RNA Program Board release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.