[ResponseOps][Alerting] Decouple feature IDs from consumers#183756
[ResponseOps][Alerting] Decouple feature IDs from consumers#183756cnasikas merged 213 commits intoelastic:mainfrom
Conversation
|
/ci |
06d8499 to
3b7c89f
Compare
f65ca2c to
a2b73f9
Compare
xcrzx
left a comment
There was a problem hiding this comment.
I performed manual testing locally. On the main branch, I installed prebuilt rules, enabled them, ingested some data, and let the rules generate alerts. Then I switched to this PR branch and verified the following:
- Rules and their generated alerts are accessible and editable.
- Rule execution logs, exceptions, lists, ML jobs, and other related functionality are working as expected.
No issues were found.
I also tested a downgrade scenario, where rules, alerts, and other entities created on the PR branch were verified to function correctly after switching back to the main branch. Additionally, I tested functionality in a non-default space, and no issues were encountered.
Note: My focus was on possible regressions. I didn’t test the specific case this PR addresses, when the feature ID doesn’t match the consumer, due to the significantly more complex setup required for that test.
|
/ci |
| AlertConsumers.ALERTS, | ||
| ]; | ||
|
|
||
| export const apmAlertingRuleTypeIds: string[] = [...OBSERVABILITY_RULE_TYPE_IDS]; |
There was a problem hiding this comment.
| export const apmAlertingRuleTypeIds: string[] = [...OBSERVABILITY_RULE_TYPE_IDS]; | |
| export const apmAlertingRuleTypeIds: string[] = OBSERVABILITY_RULE_TYPE_IDS; |
There was a problem hiding this comment.
The OBSERVABILITY_RULE_TYPE_IDS is used in a lot of places and I wanted to avoid someone mutating the original array. const protect us with assignments but it does not with push.
...lity_solution/infra/public/pages/metrics/hosts/components/tabs/alerts/alerts_tab_content.tsx
Outdated
Show resolved
Hide resolved
...observability_solution/infra/public/pages/metrics/hosts/components/tabs/alerts_tab_badge.tsx
Outdated
Show resolved
Hide resolved
ashokaditya
left a comment
There was a problem hiding this comment.
Did code review only and LGTM. Thanks for taking on the massive changeset. 🙇🏼
💚 Build Succeeded
Metrics [docs]Module Count
Public APIs missing comments
Async chunks
Public APIs missing exports
Page load bundle
Unknown metric groupsAPI count
ESLint disabled line counts
Total ESLint disabled count
History
cc @cnasikas |
|
Starting backport for target branches: 8.x |
💔 All backports failed
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
Summary
This PR aims to decouple the feature IDs from the
consumerattribute of rules and alerts.Towards: #187202
Fixes: #181559
Fixes: #182435
Note
Unfortunately, I could not break the PR into smaller pieces. The APIs could not work anymore with feature IDs and had to convert them to use rule type IDs. Also, I took the chance and refactored crucial parts of the authorization class that in turn affected a lot of files. Most of the changes in the files are minimal and easy to review. The crucial changes are in the authorization class and some alerting APIs.
Architecture
Alerting RBAC model
The Kibana security uses Elasticsearch's application privileges. This way Kibana can represent and store its privilege models within Elasticsearch roles. To do that, Kibana security creates actions that are granted by a specific privilege. Alerting uses its own RBAC model and is built on top of the existing Kibana security model. The Alerting RBAC uses the
rule_type_idandconsumerattributes to define who owns the rule and the alerts procured by the rule. To connect therule_type_idandconsumerwith the Kibana security actions the Alerting RBAC registers its custom actions. They are constructed asalerting:<rule-type-id>/<feature-id>/<alerting-entity>/<operation>. Because to authorizate a resource an action has to be generated and because the action needs a valid feature ID the value of theconsumershould be a valid feature ID. For example, thealerting:siem.esqlRule/siem/rule/getaction, means that a user with a role that grants this action can get a rule of typesiem.esqlRulewith consumersiem.Problem statement
At the moment the
consumerattribute should be a valid feature ID. Though this approach worked well so far it has its limitation. Specifically:consumerattribute.Proposed solution
This PR aims to decouple the feature IDs from consumers. It achieves that a) by changing the way solutions configure the alerting privileges when registering a feature and b) by changing the alerting actions. The schema changes as:
The new actions are constructed as
alerting:<rule-type-id>/<consumer>/<alerting-entity>/<operation>. For examplealerting:rule-type-id/my-consumer/rule/get. The new action means that a user with a role that grants this action can get a rule of typerule-typewith consumermy-consumer. Changing the action strings is not considered a breaking change as long as the user's permission works as before. In our case, this is true because the consumer will be the same as before (feature ID), and the alerting security actions will be the same. For example:Old formatting
Schema:
Generated action:
New formatting
Schema:
Generated action:
In both formating the actions are the same thus breaking changes are avoided.
Alerting authorization class
The alerting plugin uses and exports the alerting authorization class (
AlertingAuthorization). The class is responsible for handling all authorization actions related to rules and alerts. The class changed to handle the new actions as described in the above sections. A lot of methods were renamed, removed, and cleaned up, all method arguments converted to be an object, and the response signature of some methods changed. These changes affected various pieces of the code. The changes in this class are the most important in this PR especially the_getAuthorizedRuleTypesWithAuthorizedConsumersmethod which is the cornerstone of the alerting RBAC. Please review carefully.Instantiation of the alerting authorization class
The
AlertingAuthorizationClientFactoryis used to create instances of theAlertingAuthorizationclass. TheAlertingAuthorizationclass needs to perform async operations upon instantiation. Because JS, at the moment, does not support async instantiation of classes theAlertingAuthorizationclass was assigningPromiseobjects to variables that could be resolved later in other phases of the lifecycle of the class. To improve readability and make the lifecycle of the class clearer, I separated the construction of the class (initialization) from the bootstrap process. As a result, getting theAlertingAuthorizationclass or any client that depends on it (getRulesClientfor example) is an async operation.Filtering
A lot of routes use the authorization class to get the authorization filter (
getFindAuthorizationFilter), a filter that, if applied, returns only the rule types and consumers the user is authorized to. The method that returns the filter was built in a way to also support filtering on top of the authorization filter thus coupling the authorized filter with router filtering. I believe these two operations should be decoupled and the filter method should return a filter that gives you all the authorized rule types. It is the responsibility of the consumer, router in our case, to apply extra filters on top of the authorization filter. For that reason, I made all the necessary changes to decouple them.Legacy consumers & producer
A lot of rules and alerts have been created and are still being created from observability with the
alertsconsumer. When the Alerting RBAC encounters a rule or alert withalertsas a consumer it falls back to theproducerof the rule type ID to construct the actions. For example if a rule withruleTypeId: .es-queryandconsumer: alertsthe alerting action will be constructed asalerting:.es-query/stackAlerts/rule/getwherestackRulesis the producer of the.es-queryrule type. Theproduceris used to be used in alerting authorization but due to its complexity, it was deprecated and only used as a fallback for thealertsconsumer. To avoid breaking changes all feature privileges that specify access to rule types add thealertsconsumer when configuring their alerting privileges. By moving thealertsconsumer to the registration of the feature we can stop relying on theproducer. Theproduceris not used anymore in the authorization class. In the next PRs theproducerwill removed entirely.Routes
The following changes were introduced to the alerting routes:
ruleTypeIdsand theconsumersparameters for filtering. In all routes, the filters are constructed asruleTypeIds: ['foo'] AND consumers: ['bar'] AND authorizationFilter. Filtering by consumers is important. In o11y for example, we do not want to show ES rule types with thestackAlertsconsumer even if the user has access to them./internal/rac/alerts/_feature_idsroute got deleted as it was not used anywhere in the codebase and it was internal.All the changes in the routes are related to internal routes and no breaking changes are introduced.
Constants
I moved the o11y and stack rule type IDs to
kbn-rule-data-utilsand exported all security solution rule type IDs fromkbn-securitysolution-rules. I am not a fan of having a centralized place for the rule type IDs. Ideally, consumers of the framework should specify keywords likeobservablility(category or subcategory) or evenapm.*and the framework should know which rule type IDs to pick up. I think it is out of the scope of the PR, and at the moment it seems the most straightforward way to move forward. I will try to clean up as much as possible in further iterations. If you are interested in the upcoming work follow this issue #187202.Other notable code changes
isSiemRuleType: This is a temporary helper function that is needed in places where we handle edge cases related to security solution rule types. Ideally, the framework should be agnostic to the rule types or consumers. The plan is to be removed entirely in further iterations.PluginSetupContractandPluginStartContracttoAlertingServerSetupandAlertingServerStart. This made me touch a lot of files but I could not resist.filter_consumerswas mistakenly exposed to a public API. It was undocumented.listmethod of theRuleTypeRegistryfromSet<RegistryRuleType>toMap<string, RegistryRuleType>.KueryNodein tests changed to an assertion of KQL usingtoKqlExpression.useRuleAADFieldsas it is not used anywhere.Testing
Caution
It is very important to test all the areas of the application where rules or alerts are being used directly or indirectly. Scenarios to consider:
Solutions
Please test and verify that:
ResponseOps
The most important changes are in the alerting authorization class, the search strategy, and the routes. Please test:
Risks
Warning
The risks involved in this PR are related to privileges. Specifically:
An excessive list of integration tests is in place to ensure that the above scenarios will not occur. In the case of a bug, we could a) release an energy release for serverless and b) backport the fix in ESS. Given that this PR is intended to be merged in 8.17 we have plenty of time to test and to minimize the chances of risks.
FQA
filterparameter where we can pass an arbitrary KQL filter. Why we do not use this to filter by the rule type IDs and the consumers and instead we introduce new dedicated parameters?The
filterparameter should not be exposed in the first place. It assumes that the consumer of the API knows the underlying structure and implementation details of the persisted storage API (SavedObject client API). For example, a valid filter would bealerting.attributes.rule_type_id. In this filter the consumer should know a) the name of the SO b) the keywordattributes(storage implementation detail) and c) the name of the attribute as it is persisted in ES (snake case instead of camel case as it is returned by the APIs). As there is no abstraction layer between the SO and the API, it makes it very difficult to make changes in the persistent schema or the APIs. For all the above I decided to introduce new query parameters where the alerting framework has total control over it.This PR is a step forward making the framework as agnostic as possible. I had to keep the scope of the PR as contained as possible. We will get there. It needs time :).
siem. Should not remove the hacks?This PR is a step forward making the framework as agnostic as possible. I had to keep the scope of the PR as contained as possible. We will get there. It needs time :).
I also do not like it. The goal is to remove it. Follow #189997.