Remove single shard optimization when suggesting shard_size#37041
Merged
javanna merged 4 commits intoelastic:masterfrom Jan 2, 2019
Merged
Conversation
When executing terms aggregations we set the shard_size, meaning the number of buckets to collect on each shard, to a value that's higher than the number of requested buckets, to guarantee some basic level of precision. We have an optimization in place so that we leave shard_size set to size whenever we are searching against a single shard, in which case maximum precision is guaranteed by definition. Such optimization requires us access to the total number of shards that the search is executing against. In the context of cross-cluster search, once we will introduce multiple reduction steps (one per cluster) each cluster will only know the number of local shards, which is problematic as we can only optimize if we are searching against a single shard in a single cluster. While discussing how to address the CCS scenario, we decided that we do not want to introduce further complexity caused by this single shard optimization that benefits only a minority of cases, especially when the benefits are not so huge. This commit removes the single shard optimization, meaning that we will always have heuristic enabled on how many number of buckets to collect on the shards, even when searching against a single shard.
Collaborator
|
Pinging @elastic/es-analytics-geo |
11 tasks
jpountz
approved these changes
Dec 31, 2018
Contributor
jpountz
left a comment
There was a problem hiding this comment.
It looks good. Should we remove SearchContext#numberOfShards entirely to avoid adding back something like that in the future?
Contributor
Author
Would be nice, but it is still used in |
Contributor
Author
|
retest this please |
Contributor
Author
|
run gradle build tests 2 |
Contributor
Author
javanna
added a commit
that referenced
this pull request
Jan 7, 2019
When executing terms aggregations we set the shard_size, meaning the number of buckets to collect on each shard, to a value that's higher than the number of requested buckets, to guarantee some basic level of precision. We have an optimization in place so that we leave shard_size set to size whenever we are searching against a single shard, in which case maximum precision is guaranteed by definition. Such optimization requires us access to the total number of shards that the search is executing against. In the context of cross-cluster search, once we will introduce multiple reduction steps (one per cluster) each cluster will only know the number of local shards, which is problematic as we should only optimize if we are searching against a single shard in a single cluster. It could be that we are searching against one shard per cluster in which case the current code would optimize number of terms causing a loss of precision. While discussing how to address the CCS scenario, we decided that we do not want to introduce further complexity caused by this single shard optimization, as it benefits only a minority of cases, especially when the benefits are not so great. This commit removes the single shard optimization, meaning that we will always have heuristic enabled on how many number of buckets to collect on the shards, even when searching against a single shard. This will cause more buckets to be collected when searching against a single shard compared to before. If that becomes a problem for some users, they can work around that by setting the shard_size equal to the size. Relates to #32125
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When executing terms aggregations we set the shard_size, meaning the
number of buckets to collect on each shard, to a value that's higher than
the number of requested buckets, to guarantee some basic level of
precision. We have an optimization in place so that we leave shard_size
set to size whenever we are searching against a single shard, in which
case maximum precision is guaranteed by definition.
Such optimization requires us access to the total number of shards that
the search is executing against. In the context of cross-cluster search,
once we will introduce multiple reduction steps (one per cluster) each
cluster will only know the number of local shards, which is problematic
as we should only optimize if we are searching against a single shard in a
single cluster. It could be that we are searching against one shard per cluster
in which case the current code would optimize number of terms causing
a loss of precision.
While discussing how to address the CCS scenario, we decided that we do
not want to introduce further complexity caused by this single shard
optimization, as it benefits only a minority of cases, especially when
the benefits are not so great.
This commit removes the single shard optimization, meaning that we will
always have heuristic enabled on how many number of buckets to collect
on the shards, even when searching against a single shard.
This will cause more buckets to be collected when searching against a single
shard compared to before. If that becomes a problem for some users, they
can work around that by setting the
shard_sizeequal to thesize.Relates to #32125