Hiya SIEM team. Over at Elasticsearch we've been looking into a few performance related items, and some of the aggs that SIEM dashboard uses caught our eye.
Benchmarks?
Do we benchmark any of the dashboards? The Elasticsearch team uses Rally extensively, perhaps we could find a way to translate the dashboard requests into some kind of rally track? It'd help both of us keep an eye on performance, make changes easier to think about, and easier to collaborate on since we'd have shared dataset to look at
Usage of filter aggs
There seems to be widespread use of filter aggs, which is non-ideal. Filter aggs are relatively expensive, especially when compared to filtering in the query component of a search request. Each individual filter agg needs to load the bitset of docs that contain that value, and check it against the doc one-by-one (as opposed to query filters which can use a leap-frog mechanism to minimize checks).
So the first thing would be trying to move filter aggs up into the query where possible, if they are being used to exclude documents.
If they are being used for counts (like here), there are some options:
-
Try to rewrite some of those to operate as terms aggs. E.g. if multiple filters share the same field (event.module or something), a terms agg will give you doc counts for all the different event modules. Terms is pretty aggressively optimized because it is so widely used. It's hard to say for sure if it would help, but from some informal testing (see rally test at end) it tends to be noticeably faster.
-
For fields that are non-overlapping and sparse, a value_count agg can be useful. E.g. if only a subset of docs have a certain field and you want to know how many there are, a value_count on that field will return the count without having to bucket them. A relatively niche usage here, but handy if applicable
-
Rewrite into an msearch and skip aggregating all together. Each msearch clause will be a single search request filtering for specifically the criteria needed. With size: 0 you don't incur a fetch-overhead, and with track_total_hits: true you can still get the total count.
3b. If you don't need exact counts, setting track_total_hits: false will enable the new block max-wand optimization and return results very fast. You can configure a threshold when it stops counting, so you can say "> 100,000 results", etc
I ran a simple test showing msearch ("count"), filter, filters, term and value_count. As you can see, msearch is fastest by a large margin, followed by term and value_count. Filter/filters are generally slower

terms instead of filter for partitioning
Related to 1) above, if there is a scenario where you wish to partition the same field into multiple buckets, a terms agg will be faster (and simpler query) than a series of filter aggs. For example, this request uses two filter aggs to create "success" and "failure" buckets.
Instead, a single terms agg on the field will produce both buckets and do it cheaper. In addition, the child filter: event.outcome: success agg is unnecessary because by the nature of the parent bucket, all docs in that bucket are already success/failure. You can just grab the count from the bucket doc_count.
If there are unrelated values in the field and you only want "success"/"failure", you can use the include/exclude functionality of a terms agg to only include terms you care about.
AutoDateHistogram min_interval
There's some optimization work done in ES (coming 7.8/7.9) which will improve auto-date-histo speed noticeably. But in the mean time, specifying a min_interval will help prevent extra work. E.g. auto-date-histo will start with second-level intervals and round up from there. If querying a 12h time range it almost never makes sense to look at second-intervals, so that part of the rounding is wasted effort.
This does remove some of the convenience of "fire and forget" aspect of auto-date-histo, but it can translate into notable performance improvements. I'm not sure the best option here, but if there's a way to intelligently set min_interval it'd probably help.
Closing
Sorry for the long ticket! I decided to file this as a ticket instead of email/slack/google doc/etc because it seemed easier to work through on github. Feel free to ping me if you have questions, happy to help out! It's hard to say for sure if any of these suggestions will actually help (although the msearch case is very compelling due to how it works), which is why I led with the question about benchmarks. Setting those up might be a good first step so we can quantitatively tweak the queries/aggs.
Hiya SIEM team. Over at Elasticsearch we've been looking into a few performance related items, and some of the aggs that SIEM dashboard uses caught our eye.
Benchmarks?
Do we benchmark any of the dashboards? The Elasticsearch team uses Rally extensively, perhaps we could find a way to translate the dashboard requests into some kind of rally track? It'd help both of us keep an eye on performance, make changes easier to think about, and easier to collaborate on since we'd have shared dataset to look at
Usage of
filteraggsThere seems to be widespread use of
filteraggs, which is non-ideal. Filter aggs are relatively expensive, especially when compared to filtering in thequerycomponent of a search request. Each individual filter agg needs to load the bitset of docs that contain that value, and check it against the doc one-by-one (as opposed toqueryfilters which can use a leap-frog mechanism to minimize checks).So the first thing would be trying to move
filteraggs up into the query where possible, if they are being used to exclude documents.If they are being used for counts (like here), there are some options:
Try to rewrite some of those to operate as
termsaggs. E.g. if multiple filters share the same field (event.moduleor something), a terms agg will give you doc counts for all the different event modules. Terms is pretty aggressively optimized because it is so widely used. It's hard to say for sure if it would help, but from some informal testing (see rally test at end) it tends to be noticeably faster.For fields that are non-overlapping and sparse, a
value_countagg can be useful. E.g. if only a subset of docs have a certain field and you want to know how many there are, avalue_counton that field will return the count without having to bucket them. A relatively niche usage here, but handy if applicableRewrite into an
msearchand skip aggregating all together. Each msearch clause will be a single search request filtering for specifically the criteria needed. Withsize: 0you don't incur a fetch-overhead, and withtrack_total_hits: trueyou can still get the total count.3b. If you don't need exact counts, setting
track_total_hits: falsewill enable the new block max-wand optimization and return results very fast. You can configure a threshold when it stops counting, so you can say"> 100,000 results", etcI ran a simple test showing msearch (
"count"), filter, filters, term and value_count. As you can see, msearch is fastest by a large margin, followed by term and value_count. Filter/filters are generally slowertermsinstead offilterfor partitioningRelated to 1) above, if there is a scenario where you wish to partition the same field into multiple buckets, a
termsagg will be faster (and simpler query) than a series offilteraggs. For example, this request uses twofilteraggs to create "success" and "failure" buckets.Instead, a single
termsagg on the field will produce both buckets and do it cheaper. In addition, the childfilter: event.outcome: successagg is unnecessary because by the nature of the parent bucket, all docs in that bucket are already success/failure. You can just grab the count from the bucket doc_count.If there are unrelated values in the field and you only want "success"/"failure", you can use the
include/excludefunctionality of atermsagg to only include terms you care about.AutoDateHistogram min_interval
There's some optimization work done in ES (coming 7.8/7.9) which will improve auto-date-histo speed noticeably. But in the mean time, specifying a
min_intervalwill help prevent extra work. E.g. auto-date-histo will start with second-level intervals and round up from there. If querying a 12h time range it almost never makes sense to look at second-intervals, so that part of the rounding is wasted effort.This does remove some of the convenience of "fire and forget" aspect of auto-date-histo, but it can translate into notable performance improvements. I'm not sure the best option here, but if there's a way to intelligently set
min_intervalit'd probably help.Closing
Sorry for the long ticket! I decided to file this as a ticket instead of email/slack/google doc/etc because it seemed easier to work through on github. Feel free to ping me if you have questions, happy to help out! It's hard to say for sure if any of these suggestions will actually help (although the msearch case is very compelling due to how it works), which is why I led with the question about benchmarks. Setting those up might be a good first step so we can quantitatively tweak the queries/aggs.