Optimize date_histograms across daylight savings time#55559
Merged
nik9000 merged 46 commits intoelastic:masterfrom May 7, 2020
Merged
Optimize date_histograms across daylight savings time#55559nik9000 merged 46 commits intoelastic:masterfrom
nik9000 merged 46 commits intoelastic:masterfrom
Conversation
Collaborator
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Member
Author
|
Thanks for the reviews @danielmitterdorfer and @not-napoleon! With @DaveCTurner's faith in the tests I'm going to consider this one approved. The tests for this stuff really are quite broad. I will make the final changes to this to prepare it for merging and inclusion into 7.9.0. |
Member
Author
Branch cut day is always rough. |
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
…astic#55559) Rounding dates on a shard that contains a daylight savings time transition is currently something like 1400% slower than when a shard contains dates only on one side of the DST transition. And it makes a ton of short lived garbage. This replaces that implementation with one that benchmarks to having around 30% overhead instead of the 1400%. And it doesn't generate any garbage per search hit. Some background: There are two ways to round in ES: * Round to the nearest time unit (Day/Hour/Week/Month/etc) * Round to the nearest time *interval* (3 days/2 weeks/etc) I'm only optimizing the first one in this change and plan to do the second in a follow up. It turns out that rounding to the nearest unit really *is* two problems: when the unit rounds to midnight (day/week/month/year) and when it doesn't (hour/minute/second). Rounding to midnight is consistently about 25% faster and rounding to individual hour or minutes. This optimization relies on being able to *usually* figure out what the minimum and maximum dates are on the shard. This is similar to an existing optimization where we rewrite time zones that aren't fixed (think America/New_York and its daylight savings time transitions) into fixed time zones so long as there isn't a daylight savings time transition on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement time interval rounding the time zone rewriting optimization *should* no longer be needed. This optimization doesn't come into play for `composite` or `auto_date_histogram` aggs because neither have been migrated to the new `DATE` `ValuesSourceType` which is where that range lookup happens. When they are they will be able to pick up the optimization without much work. I expect this to be substantial for `auto_date_histogram` but less so for `composite` because it deals with fewer values. Note: My 30% overhead figure comes from small numbers of daylight savings time transitions. That overhead gets higher when there are more transitions in logarithmic fashion. When there are two thousand years worth of transitions my algorithm ends up being 250% slower than rounding without a time zone, but java time is 47000% slower at that point, allocating memory as fast as it possibly can.
nik9000
added a commit
that referenced
this pull request
May 7, 2020
…5559) (#56334) Rounding dates on a shard that contains a daylight savings time transition is currently something like 1400% slower than when a shard contains dates only on one side of the DST transition. And it makes a ton of short lived garbage. This replaces that implementation with one that benchmarks to having around 30% overhead instead of the 1400%. And it doesn't generate any garbage per search hit. Some background: There are two ways to round in ES: * Round to the nearest time unit (Day/Hour/Week/Month/etc) * Round to the nearest time *interval* (3 days/2 weeks/etc) I'm only optimizing the first one in this change and plan to do the second in a follow up. It turns out that rounding to the nearest unit really *is* two problems: when the unit rounds to midnight (day/week/month/year) and when it doesn't (hour/minute/second). Rounding to midnight is consistently about 25% faster and rounding to individual hour or minutes. This optimization relies on being able to *usually* figure out what the minimum and maximum dates are on the shard. This is similar to an existing optimization where we rewrite time zones that aren't fixed (think America/New_York and its daylight savings time transitions) into fixed time zones so long as there isn't a daylight savings time transition on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement time interval rounding the time zone rewriting optimization *should* no longer be needed. This optimization doesn't come into play for `composite` or `auto_date_histogram` aggs because neither have been migrated to the new `DATE` `ValuesSourceType` which is where that range lookup happens. When they are they will be able to pick up the optimization without much work. I expect this to be substantial for `auto_date_histogram` but less so for `composite` because it deals with fewer values. Note: My 30% overhead figure comes from small numbers of daylight savings time transitions. That overhead gets higher when there are more transitions in logarithmic fashion. When there are two thousand years worth of transitions my algorithm ends up being 250% slower than rounding without a time zone, but java time is 47000% slower at that point, allocating memory as fast as it possibly can.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In elastic#55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
This wires `auto_date_histogram` into the rounding optimization that I built in elastic#55559. This is should significantly speed up any `auto_date_histogram`s with `time_zone`s on them.
nik9000
added a commit
that referenced
this pull request
May 7, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In #55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In elastic#55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
that referenced
this pull request
May 8, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In #55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
that referenced
this pull request
May 9, 2020
This wires `auto_date_histogram` into the rounding optimization that I built in #55559. This is should significantly speed up any `auto_date_histogram`s with `time_zone`s on them.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 9, 2020
This wires `auto_date_histogram` into the rounding optimization that I built in elastic#55559. This is should significantly speed up any `auto_date_histogram`s with `time_zone`s on them.
nik9000
added a commit
that referenced
this pull request
May 9, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rounding dates on a shard that contains a daylight savings time transition
is currently something like 1400% slower than when a shard contains dates
only on one side of the DST transition. And it makes a ton of short lived
garbage. This replaces that implementation with one that benchmarks to
having around 30% overhead instead of the 1400%. And it doesn't generate
any garbage per search hit.
Some background:
There are two ways to round in ES:
I'm only optimizing the first one in this change and plan to do the second
in a follow up. It turns out that rounding to the nearest unit really is
two problems: when the unit rounds to midnight (day/week/month/year) and
when it doesn't (hour/minute/second). Rounding to midnight is consistently
about 25% faster and rounding to individual hour or minutes.
This optimization relies on being able to usually figure out what the
minimum and maximum dates are on the shard. This is similar to an existing
optimization where we rewrite time zones that aren't fixed
(think America/New_York and its daylight savings time transitions) into
fixed time zones so long as there isn't a daylight savings time transition
on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement
time interval rounding the time zone rewriting optimization should no
longer be needed.
This optimization doesn't come into play for
compositeorauto_date_histogramaggs because neither have been migrated to the newDATEValuesSourceTypewhich is where that range lookup happens. Whenthey are they will be able to pick up the optimization without much work.
I expect this to be substantial for
auto_date_histogrambut less so forcompositebecause it deals with fewer values.Note: My 30% overhead figure comes from small numbers of daylight savings
time transitions. That overhead gets higher when there are more
transitions in logarithmic fashion. When there are two thousand years
worth of transitions my algorithm ends up being 250% slower than rounding
without a time zone, but java time is 47000% slower at that point,
allocating memory as fast as it possibly can.