Problem description
When the user creates a log threshold alert with a "group by" field of large cardinality, the alert executor will paginate through a large number of composite aggregation pages. This can use and possibly exceed the resources available in Elasticsearch and Kibana and thereby negatively impact the availability of the service. Additionally, the alert execution might time out and miss alerts that should have been fired.
On top of that the query used for checking the condition prioritizes correctness over performance by filtering out non-matching groups as late as possible. This allows for checking for zero-count threshold, but prevents Elasticsearch from optimizing the query more aggressively.
Possible solutions
- Check and warn about high cardinality of the grouping field when creating the job.
- Offer a setting on job creation to set an acceptable cardinality limit (as in "group by host.name up to 10000 groups).
- Check the cardinality on execution and fail early and loudly when the configured limit is exceeded.
- Special case costly grouped "alert when less than" conditions and use more efficient queries for all other cases. (By moving the filter out of the
composite agg to the global bool filter.)
Problem description
When the user creates a log threshold alert with a "group by" field of large cardinality, the alert executor will paginate through a large number of
compositeaggregation pages. This can use and possibly exceed the resources available in Elasticsearch and Kibana and thereby negatively impact the availability of the service. Additionally, the alert execution might time out and miss alerts that should have been fired.On top of that the query used for checking the condition prioritizes correctness over performance by filtering out non-matching groups as late as possible. This allows for checking for zero-count threshold, but prevents Elasticsearch from optimizing the query more aggressively.
Possible solutions
compositeagg to the globalboolfilter.)