Speed up date_histogram without children (backport of #63643)#64823
Merged
nik9000 merged 1 commit intoelastic:7.xfrom Nov 9, 2020
Merged
Speed up date_histogram without children (backport of #63643)#64823nik9000 merged 1 commit intoelastic:7.xfrom
nik9000 merged 1 commit intoelastic:7.xfrom
Conversation
This speeds up `date_histogram` aggregations without a parent or children. This is quite common - it's the aggregation that Kibana's Discover uses all over the place. Also, we hope to be able to use the same mechanism to speed aggs with children one day, but that day isn't today. The kind of speedup we're seeing is fairly substantial in many cases: ``` | | | before | after | | | 90th percentile service time | date_histogram_calendar_interval | 9266.07 | 1376.13 | ms | | 90th percentile service time | date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms | | 90th percentile service time | date_histogram_fixed_interval | 8817.36 | 1312.67 | ms | | 90th percentile service time | date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg | 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms | ``` This uses the work we did in elastic#61467 to precompute the rounding points for a `date_histogram`. Now, when we know the rounding points we execute the `date_histogram` as a `range` aggregation. This is nice for two reasons: 1. We can further rewrite the `range` aggregation (see below) 2. We don't need to allocate a hash to convert rounding points to ordinals. 3. We can send precise cardinality estimates to sub-aggs. Points 2 and 3 above are nice, but most of the speed difference comes from point 1. Specifically, we now look into executing `range` aggregations as a `filters` aggregation. Normally the `filters` aggregation is quite slow but when it doesn't have a parent or any children then we can execute it "filter by filter" which is significantly faster. So fast, in fact, that it is faster than the original `date_histogram`. The `range` aggregation is *fairly* careful in how it rewrites, giving up on the `filters` aggregation if it won't collect "filter by filter" and falling back to its original execution mechanism. So an aggregation like this: ``` POST _search { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "date_histogram": { "field": "dropoff_datetime", "fixed_interval": "60d", "time_zone": "America/New_York" } } } } ``` is executed like: ``` POST _search { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "range": { "field": "dropoff_datetime", "ranges": [ {"from": 1415250000000, "to": 1420434000000}, {"from": 1420434000000, "to": 1425618000000}, {"from": 1425618000000, "to": 1430798400000}, {"from": 1430798400000, "to": 1435982400000}, {"from": 1435982400000, "to": 1441166400000}, {"from": 1441166400000, "to": 1446350400000}, {"from": 1446350400000, "to": 1451538000000}, {"from": 1451538000000} ] } } } } ``` Which in turn is executed like this: ``` POST _search { "size": 0, "query": { "range": { "dropoff_datetime": { "gte": "2015-01-01 00:00:00", "lt": "2016-01-01 00:00:00" } } }, "aggs": { "dropoffs_over_time": { "filters": { "filters": { "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}}, "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}}, "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}}, "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}}, "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}}, "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}}, "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}}, "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}} } } } } } ``` And *that* is faster because we can execute it "filter by filter". Finally, notice the `range` query filtering the data. That is required for the data set that I'm using for testing. The "filter by filter" collection mechanism for the `filters` agg needs special case handling when the query is a `range` query and the filter is a `range` query and they are both on the same field. That special case handling "merges" the range query. Without it "filter by filter" collection is substantially slower. Its still quite a bit quicker than the standard `filter` collection, but not nearly as fast as it could be.
Member
Author
|
run elasticsearch-ci/1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This speeds up
date_histogramaggregations without a parent orchildren. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.
The kind of speedup we're seeing is fairly substantial in many cases:
This uses the work we did in #61467 to precompute the rounding points for
a
date_histogram. Now, when we know the rounding points we execute thedate_histogramas arangeaggregation. This is nice for two reasons:rangeaggregation (see below)to ordinals.
Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing
rangeaggregations asa
filtersaggregation. Normally thefiltersaggregation is quite slowbut when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original
date_histogram.The
rangeaggregation is fairly careful in how it rewrites, giving upon the
filtersaggregation if it won't collect "filter by filter" andfalling back to its original execution mechanism.
So an aggregation like this:
is executed like:
Which in turn is executed like this:
And that is faster because we can execute it "filter by filter".
Finally, notice the
rangequery filtering the data. That is required forthe data set that I'm using for testing. The "filter by filter" collection
mechanism for the
filtersagg needs special case handling when the queryis a
rangequery and the filter is arangequery and they are both onthe same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard
filtercollection, but not nearlyas fast as it could be.