Skip to content

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

@weltenwort

Description

@weltenwort

Problem description

When the user creates a log threshold alert with a "group by" field of large cardinality, the alert executor will paginate through a large number of composite aggregation pages. This can use and possibly exceed the resources available in Elasticsearch and Kibana and thereby negatively impact the availability of the service. Additionally, the alert execution might time out and miss alerts that should have been fired.

On top of that the query used for checking the condition prioritizes correctness over performance by filtering out non-matching groups as late as possible. This allows for checking for zero-count threshold, but prevents Elasticsearch from optimizing the query more aggressively.

Possible solutions

  • Check and warn about high cardinality of the grouping field when creating the job.
  • Offer a setting on job creation to set an acceptable cardinality limit (as in "group by host.name up to 10000 groups).
  • Check the cardinality on execution and fail early and loudly when the configured limit is exceeded.
  • Special case costly grouped "alert when less than" conditions and use more efficient queries for all other cases. (By moving the filter out of the composite agg to the global bool filter.)

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions