Skip to content

[ResponseOps][Alerting] Decouple feature IDs from consumers#183756

Merged
cnasikas merged 213 commits intoelastic:mainfrom
cnasikas:poc_decouple_consumers_feature_ids
Dec 3, 2024
Merged

[ResponseOps][Alerting] Decouple feature IDs from consumers#183756
cnasikas merged 213 commits intoelastic:mainfrom
cnasikas:poc_decouple_consumers_feature_ids

Conversation

@cnasikas
Copy link
Copy Markdown
Member

@cnasikas cnasikas commented May 17, 2024

Summary

This PR aims to decouple the feature IDs from the consumer attribute of rules and alerts.

Towards: #187202
Fixes: #181559
Fixes: #182435

Note

Unfortunately, I could not break the PR into smaller pieces. The APIs could not work anymore with feature IDs and had to convert them to use rule type IDs. Also, I took the chance and refactored crucial parts of the authorization class that in turn affected a lot of files. Most of the changes in the files are minimal and easy to review. The crucial changes are in the authorization class and some alerting APIs.

Architecture

Alerting RBAC model

The Kibana security uses Elasticsearch's application privileges. This way Kibana can represent and store its privilege models within Elasticsearch roles. To do that, Kibana security creates actions that are granted by a specific privilege. Alerting uses its own RBAC model and is built on top of the existing Kibana security model. The Alerting RBAC uses the rule_type_id and consumer attributes to define who owns the rule and the alerts procured by the rule. To connect the rule_type_id and consumer with the Kibana security actions the Alerting RBAC registers its custom actions. They are constructed as alerting:<rule-type-id>/<feature-id>/<alerting-entity>/<operation>. Because to authorizate a resource an action has to be generated and because the action needs a valid feature ID the value of the consumer should be a valid feature ID. For example, the alerting:siem.esqlRule/siem/rule/get action, means that a user with a role that grants this action can get a rule of type siem.esqlRule with consumer siem.

Problem statement

At the moment the consumer attribute should be a valid feature ID. Though this approach worked well so far it has its limitation. Specifically:

  • Rule types cannot support more than one consumer.
  • To associate old rules with a new feature ID required a migration on the rule's SOs and the alerts documents.
  • The API calls are feature ID-oriented and not rule-type-oriented.
  • The framework has to be aware of the values of the consumer attribute.
  • Feature IDs are tightly coupled with the alerting indices leading to bugs.
  • Legacy consumers that are not a valid feature anymore can cause bugs.
  • The framework has to be aware of legacy consumers to handle edge cases.
  • The framework has to be aware of specific consumers to handle edge cases.

Proposed solution

This PR aims to decouple the feature IDs from consumers. It achieves that a) by changing the way solutions configure the alerting privileges when registering a feature and b) by changing the alerting actions. The schema changes as:

// Old formatting
id: 'siem', <--- feature ID
alerting:['siem.queryRule']

// New formatting
id: 'siem', <--- feature ID
alerting: [{ ruleTypeId: 'siem.queryRule', consumers: ['siem'] }] <-- consumer same as the feature ID in the old formatting

The new actions are constructed as alerting:<rule-type-id>/<consumer>/<alerting-entity>/<operation>. For example alerting:rule-type-id/my-consumer/rule/get. The new action means that a user with a role that grants this action can get a rule of type rule-type with consumer my-consumer. Changing the action strings is not considered a breaking change as long as the user's permission works as before. In our case, this is true because the consumer will be the same as before (feature ID), and the alerting security actions will be the same. For example:

Old formatting

Schema:

id: 'logs', <--- feature ID
alerting:['.es-query'] <-- rule type ID

Generated action:

alerting:.es-query/logs/rule/get

New formatting

Schema:

id: 'siem', <--- feature ID
alerting: [{ ruleTypeId: '.es-query', consumers: ['logs'] }] <-- consumer same as the feature ID in the old formatting

Generated action:

alerting:.es-query/logs/rule/get <--- consumer is set as logs and the action is the same as before

In both formating the actions are the same thus breaking changes are avoided.

Alerting authorization class

The alerting plugin uses and exports the alerting authorization class (AlertingAuthorization). The class is responsible for handling all authorization actions related to rules and alerts. The class changed to handle the new actions as described in the above sections. A lot of methods were renamed, removed, and cleaned up, all method arguments converted to be an object, and the response signature of some methods changed. These changes affected various pieces of the code. The changes in this class are the most important in this PR especially the _getAuthorizedRuleTypesWithAuthorizedConsumers method which is the cornerstone of the alerting RBAC. Please review carefully.

Instantiation of the alerting authorization class

The AlertingAuthorizationClientFactory is used to create instances of the AlertingAuthorization class. The AlertingAuthorization class needs to perform async operations upon instantiation. Because JS, at the moment, does not support async instantiation of classes the AlertingAuthorization class was assigning Promise objects to variables that could be resolved later in other phases of the lifecycle of the class. To improve readability and make the lifecycle of the class clearer, I separated the construction of the class (initialization) from the bootstrap process. As a result, getting the AlertingAuthorization class or any client that depends on it (getRulesClient for example) is an async operation.

Filtering

A lot of routes use the authorization class to get the authorization filter (getFindAuthorizationFilter), a filter that, if applied, returns only the rule types and consumers the user is authorized to. The method that returns the filter was built in a way to also support filtering on top of the authorization filter thus coupling the authorized filter with router filtering. I believe these two operations should be decoupled and the filter method should return a filter that gives you all the authorized rule types. It is the responsibility of the consumer, router in our case, to apply extra filters on top of the authorization filter. For that reason, I made all the necessary changes to decouple them.

Legacy consumers & producer

A lot of rules and alerts have been created and are still being created from observability with the alerts consumer. When the Alerting RBAC encounters a rule or alert with alerts as a consumer it falls back to the producer of the rule type ID to construct the actions. For example if a rule with ruleTypeId: .es-query and consumer: alerts the alerting action will be constructed as alerting:.es-query/stackAlerts/rule/get where stackRules is the producer of the .es-query rule type. The producer is used to be used in alerting authorization but due to its complexity, it was deprecated and only used as a fallback for the alerts consumer. To avoid breaking changes all feature privileges that specify access to rule types add the alerts consumer when configuring their alerting privileges. By moving the alerts consumer to the registration of the feature we can stop relying on the producer. The producer is not used anymore in the authorization class. In the next PRs the producer will removed entirely.

Routes

The following changes were introduced to the alerting routes:

  • All related routes changed to be rule-type oriented and not feature ID oriented.
  • All related routes support the ruleTypeIds and the consumers parameters for filtering. In all routes, the filters are constructed as ruleTypeIds: ['foo'] AND consumers: ['bar'] AND authorizationFilter. Filtering by consumers is important. In o11y for example, we do not want to show ES rule types with the stackAlerts consumer even if the user has access to them.
  • The /internal/rac/alerts/_feature_ids route got deleted as it was not used anywhere in the codebase and it was internal.

All the changes in the routes are related to internal routes and no breaking changes are introduced.

Constants

I moved the o11y and stack rule type IDs to kbn-rule-data-utils and exported all security solution rule type IDs from kbn-securitysolution-rules. I am not a fan of having a centralized place for the rule type IDs. Ideally, consumers of the framework should specify keywords like observablility (category or subcategory) or even apm.* and the framework should know which rule type IDs to pick up. I think it is out of the scope of the PR, and at the moment it seems the most straightforward way to move forward. I will try to clean up as much as possible in further iterations. If you are interested in the upcoming work follow this issue #187202.

Other notable code changes

  • Change all instances of feature IDs to rule type IDs.
  • isSiemRuleType: This is a temporary helper function that is needed in places where we handle edge cases related to security solution rule types. Ideally, the framework should be agnostic to the rule types or consumers. The plan is to be removed entirely in further iterations.
  • Rename alerting PluginSetupContract and PluginStartContract to AlertingServerSetup and AlertingServerStart. This made me touch a lot of files but I could not resist.
  • filter_consumers was mistakenly exposed to a public API. It was undocumented.
  • Files or functions that were not used anywhere in the codebase got deleted.
  • Change the returned type of the list method of the RuleTypeRegistry from Set<RegistryRuleType> to Map<string, RegistryRuleType>.
  • Assertion of KueryNode in tests changed to an assertion of KQL using toKqlExpression.
  • Removal of useRuleAADFields as it is not used anywhere.

Testing

Caution

It is very important to test all the areas of the application where rules or alerts are being used directly or indirectly. Scenarios to consider:

  • The correct rules, alerts, and aggregations on top of them are being shown as expected as a superuser.
  • The correct rules, alerts, and aggregations on top of them are being shown as expected by a user with limited access to certain features.
  • The changes in this PR are backward compatible with the previous users' permissions.

Solutions

Please test and verify that:

  • All the rule types you own with all possible combinations of permissions both in ESS and in Serverless.
  • The consumers and rule types make sense when registering the features.
  • The consumers and rule types that are passed to the components are the intended ones.

ResponseOps

The most important changes are in the alerting authorization class, the search strategy, and the routes. Please test:

  • The rules we own with all possible combinations of permissions.
  • The stack alerts page and its solution filtering.
  • The categories filtering in the maintenance window UI.

Risks

Warning

The risks involved in this PR are related to privileges. Specifically:

  • Users with no privileges can access rules and alerts they do not have access to.
  • Users with privileges cannot access rules and alerts they have access to.

An excessive list of integration tests is in place to ensure that the above scenarios will not occur. In the case of a bug, we could a) release an energy release for serverless and b) backport the fix in ESS. Given that this PR is intended to be merged in 8.17 we have plenty of time to test and to minimize the chances of risks.

FQA

  • I noticed that a lot of routes support the filter parameter where we can pass an arbitrary KQL filter. Why we do not use this to filter by the rule type IDs and the consumers and instead we introduce new dedicated parameters?

The filter parameter should not be exposed in the first place. It assumes that the consumer of the API knows the underlying structure and implementation details of the persisted storage API (SavedObject client API). For example, a valid filter would be alerting.attributes.rule_type_id. In this filter the consumer should know a) the name of the SO b) the keyword attributes (storage implementation detail) and c) the name of the attribute as it is persisted in ES (snake case instead of camel case as it is returned by the APIs). As there is no abstraction layer between the SO and the API, it makes it very difficult to make changes in the persistent schema or the APIs. For all the above I decided to introduce new query parameters where the alerting framework has total control over it.

  • I noticed in the code a lot of instances where the consumer is used. Should not remove any logic around consumers?

This PR is a step forward making the framework as agnostic as possible. I had to keep the scope of the PR as contained as possible. We will get there. It needs time :).

  • I noticed a lot of hacks like checking if the rule type is siem. Should not remove the hacks?

This PR is a step forward making the framework as agnostic as possible. I had to keep the scope of the PR as contained as possible. We will get there. It needs time :).

  • I hate the "Role visibility" dropdown. Can we remove it?

I also do not like it. The goal is to remove it. Follow #189997.

@cnasikas cnasikas added the release_note:skip Skip the PR/issue when compiling release notes label May 17, 2024
@cnasikas cnasikas self-assigned this May 17, 2024
@cnasikas
Copy link
Copy Markdown
Member Author

/ci

@cnasikas cnasikas force-pushed the poc_decouple_consumers_feature_ids branch from 06d8499 to 3b7c89f Compare May 23, 2024 15:47
@cnasikas cnasikas changed the title [POC] Decouple feature IDs from consumres [POC] Decouple feature IDs from consumers May 28, 2024
@cnasikas cnasikas force-pushed the poc_decouple_consumers_feature_ids branch from f65ca2c to a2b73f9 Compare July 14, 2024 15:08
Copy link
Copy Markdown
Contributor

@xcrzx xcrzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I performed manual testing locally. On the main branch, I installed prebuilt rules, enabled them, ingested some data, and let the rules generate alerts. Then I switched to this PR branch and verified the following:

  • Rules and their generated alerts are accessible and editable.
  • Rule execution logs, exceptions, lists, ML jobs, and other related functionality are working as expected.

No issues were found.

I also tested a downgrade scenario, where rules, alerts, and other entities created on the PR branch were verified to function correctly after switching back to the main branch. Additionally, I tested functionality in a non-default space, and no issues were encountered.

Note: My focus was on possible regressions. I didn’t test the specific case this PR addresses, when the feature ID doesn’t match the consumer, due to the significantly more complex setup required for that test.

xcrzx

This comment was marked as duplicate.

@pborgonovi
Copy link
Copy Markdown
Contributor

/ci

Copy link
Copy Markdown
Contributor

@consulthys consulthys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGT Stack Monitoring

AlertConsumers.ALERTS,
];

export const apmAlertingRuleTypeIds: string[] = [...OBSERVABILITY_RULE_TYPE_IDS];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const apmAlertingRuleTypeIds: string[] = [...OBSERVABILITY_RULE_TYPE_IDS];
export const apmAlertingRuleTypeIds: string[] = OBSERVABILITY_RULE_TYPE_IDS;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OBSERVABILITY_RULE_TYPE_IDS is used in a lot of places and I wanted to avoid someone mutating the original array. const protect us with assignments but it does not with push.

Copy link
Copy Markdown
Contributor

@cauemarcondes cauemarcondes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@ashokaditya ashokaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did code review only and LGTM. Thanks for taking on the massive changeset. 🙇🏼

@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Dec 3, 2024

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
alerting 248 247 -1
cases 823 827 +4
observability 1074 1073 -1
securitySolution 6310 6308 -2
triggersActionsUi 862 860 -2
total -2

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/alerts-ui-shared 305 295 -10
@kbn/rule-data-utils 135 140 +5
@kbn/securitysolution-rules 25 26 +1
alerting 848 839 -9
ml 63 64 +1
ruleRegistry 226 222 -4
uptime 1 0 -1
total -17

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
alerting 94.1KB 91.4KB -2.6KB
apm 3.4MB 3.4MB +84.0B
cases 492.7KB 492.8KB +80.0B
discover 812.9KB 813.3KB +400.0B
infra 1.8MB 1.8MB -358.0B
ml 4.7MB 4.7MB -217.0B
observability 472.8KB 472.0KB -830.0B
observabilityLogsExplorer 147.3KB 147.7KB +358.0B
securitySolution 14.6MB 14.6MB -68.0B
slo 848.0KB 848.3KB +275.0B
synthetics 1.1MB 1.1MB +404.0B
triggersActionsUi 1.7MB 1.7MB -359.0B
uptime 467.5KB 467.5KB +4.0B
total -2.8KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
features 2 3 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
apm 38.1KB 38.6KB +531.0B
cases 159.5KB 159.6KB +146.0B
infra 56.9KB 58.1KB +1.2KB
ml 75.8KB 76.5KB +681.0B
observability 103.8KB 104.5KB +766.0B
observabilityShared 93.7KB 93.7KB +36.0B
slo 28.8KB 29.2KB +394.0B
synthetics 37.6KB 37.8KB +150.0B
triggersActionsUi 128.9KB 128.5KB -363.0B
uptime 22.9KB 22.9KB +1.0B
total +3.5KB
Unknown metric groups

API count

id before after diff
@kbn/alerts-grouping 31 32 +1
@kbn/alerts-ui-shared 321 311 -10
@kbn/rule-data-utils 138 152 +14
@kbn/securitysolution-rules 28 29 +1
alerting 880 872 -8
ml 148 149 +1
ruleRegistry 263 259 -4
total -5

ESLint disabled line counts

id before after diff
@kbn/test-suites-xpack 722 721 -1

Total ESLint disabled count

id before after diff
@kbn/test-suites-xpack 747 746 -1

History

cc @cnasikas

@kibanamachine
Copy link
Copy Markdown
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/12137618585

@kibanamachine
Copy link
Copy Markdown
Contributor

💔 All backports failed

Status Branch Result
8.x Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 183756

Questions ?

Please refer to the Backport tool documentation

@cnasikas
Copy link
Copy Markdown
Member Author

cnasikas commented Dec 3, 2024

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

apm:review ci:build-serverless-image ci:cloud-deploy Create or update a Cloud deployment ci:cloud-persist-deployment Persist cloud deployment indefinitely ci:project-deploy-observability Create an Observability project Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. Team:Obs AI Assistant Observability AI Assistant Team:obs-ux-infra_services - DEPRECATED DEPRECATED - Use Team:obs-presentation. Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v8.18.0 v9.0.0

Projects

None yet