Shortcut query phase using the results of other shards#51852
Merged
jimczi merged 53 commits intoelastic:masterfrom Mar 17, 2020
Merged
Shortcut query phase using the results of other shards#51852jimczi merged 53 commits intoelastic:masterfrom
jimczi merged 53 commits intoelastic:masterfrom
Conversation
This change ensures that the rewrite of the shard request is executed in the network thread or in the refresh listener when waiting for an active shard. This allows queries that rewrite to match_no_docs to bypass the search thread pool entirely even if the can_match phase was skipped (pre_filter_shard_size > number of shards). Coordinating nodes don't have the ability to create empty responses so this change also ensures that at least one shard creates a full empty response while the other can return null ones. This is needed since creating true empty responses on shards require to create concrete aggregators which would be too costly to build on a network thread. We should move this functionality to aggregation builders in a follow up but that would be a much bigger change. This change is also important for elastic#49601 since we want to add the ability to use the result of other shards to rewrite the request of subsequent ones. For instance if the first M shards have their top N computed, the top worst document in the global queue can be pass to subsequent shards that can then rewrite to match_no_docs if they can guarantee that they don't have any document better than the provided one.
Collaborator
|
Pinging @elastic/es-distributed (:Distributed/Distributed) |
Collaborator
|
Pinging @elastic/es-search (:Search/Search) |
mayya-sharipova
approved these changes
Feb 10, 2020
Contributor
mayya-sharipova
left a comment
There was a problem hiding this comment.
Overall, LGTM, I just left a couple of comments.
| return new FetchSearchPhase(results, searchPhaseController, context); | ||
| } | ||
|
|
||
| ShardSearchRequest rewriteShardRequest(ShardSearchRequest request) { |
Contributor
There was a problem hiding this comment.
Not relevant to this PR, but In future, do we want to rewrite also requests without sort ( e.g. a keyword search) that can be shortcut?
Contributor
Author
There was a problem hiding this comment.
I am not sure I follow. Are you talking of handling queries sorted by _score ? We can probably propagate the global min competitive score up to the query collector so that wouldn't require any rewrite.
server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchPhaseController.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/internal/ShardSearchRequest.java
Outdated
Show resolved
Hide resolved
jimczi
added a commit
that referenced
this pull request
Mar 18, 2020
jimczi
added a commit
that referenced
this pull request
Mar 18, 2020
This commit, built on top of #51708, allows to modify shard search requests based on informations collected on other shards. It is intended to speed up sorted queries on time-based indices. For queries that are only interested in the top documents. This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard. For queries that mix top documents and aggregations this change will reset the size of the top documents to 0 instead of rewriting to match none. This means that we don't need to keep a search context open for this shard since we know in advance that it doesn't contain any competitive hit.
jimczi
added a commit
to jimczi/elasticsearch
that referenced
this pull request
Mar 18, 2020
This commit disables the sort optimization added in elastic#51852 for scroll requests. Scroll queries keep a state per shard so we cannot modify the request on the first round (submit). This bug was introduced in non-released versions which is why this pr is marked as a non-issue.
jimczi
added a commit
that referenced
this pull request
Mar 19, 2020
This commit disables the sort optimization added in #51852 for scroll requests. Scroll queries keep a state per shard so we cannot modify the request on the first round (submit). This bug was introduced in non-released versions which is why this pr is marked as a non-issue.
jimczi
added a commit
that referenced
this pull request
Mar 19, 2020
This commit disables the sort optimization added in #51852 for scroll requests. Scroll queries keep a state per shard so we cannot modify the request on the first round (submit). This bug was introduced in non-released versions which is why this pr is marked as a non-issue.
jimczi
added a commit
to jimczi/elasticsearch
that referenced
this pull request
Aug 6, 2020
Collapse search queries that sort by a field can throw an ArrayStoreException due to a bug in the [sort optimization](elastic#51852) introduced in 7.7.0. Search collapsing were not supposed to be eligible for this sort optimization so this change explicitly filters them from this new feature.
jimczi
added a commit
that referenced
this pull request
Aug 6, 2020
Collapse search queries that sort by a field can throw an ArrayStoreException due to a bug in the [sort optimization](#51852) introduced in 7.7.0. Search collapsing were not supposed to be eligible for this sort optimization so this change explicitly filters them from this new feature.
jimczi
added a commit
that referenced
this pull request
Aug 6, 2020
Collapse search queries that sort by a field can throw an ArrayStoreException due to a bug in the [sort optimization](#51852) introduced in 7.7.0. Search collapsing were not supposed to be eligible for this sort optimization so this change explicitly filters them from this new feature.
jimczi
added a commit
that referenced
this pull request
Aug 6, 2020
Collapse search queries that sort by a field can throw an ArrayStoreException due to a bug in the [sort optimization](#51852) introduced in 7.7.0. Search collapsing were not supposed to be eligible for this sort optimization so this change explicitly filters them from this new feature.
javanna
added a commit
to javanna/elasticsearch
that referenced
this pull request
Dec 1, 2022
Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna
added a commit
to javanna/rally-tracks
that referenced
this pull request
Dec 19, 2022
… and geonames tracks We recently found a regression that affected searches sorted by keyword field (elastic/elasticsearch#92026). Given that we had no benchmarks for sorting by keyword, this commit adds the relevant operations to the http-logs and geonames tracks. Geonames is a good base but it's good to also make the new challenges part of the many-shards benchmarks as differences can be appreciated with a high amount of shards involved in a query. This commit adds also specific challenges to verify the effect of elastic/elasticsearch#51852 when a search is sorted by numeric or timestamp.
javanna
added a commit
that referenced
this pull request
Jan 11, 2023
Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna
added a commit
to javanna/elasticsearch
that referenced
this pull request
Jan 11, 2023
…#92026) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna
added a commit
that referenced
this pull request
Jan 12, 2023
…#92854) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
danielmitterdorfer
pushed a commit
to danielmitterdorfer/elasticsearch
that referenced
this pull request
Jan 12, 2023
…#92026) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna
added a commit
to javanna/elasticsearch
that referenced
this pull request
Jan 12, 2023
…#92026) Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (elastic#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna
added a commit
that referenced
this pull request
Jan 12, 2023
Whenever sorting on a date, numeric or keyword field (as primary sort), the can_match phase retrieves min and max for the field and sorts the shards (asc or desc depending on the sort order) so that they are going to be queried following that order. This allows incremental results to be exposed in that same order when using async search, as well as optimizations built on top of such behaviour (#51852). For fields with points we call `getMinPackedValue` and `getMaxPackedValue`, while for keyword fields we call `Terms#getMin` and `Terms#getMax`. Elasticsearch uses `FilterTerms` implementations to cancel queries as well as to track field usage. Such filter implementations should delegate their `getMin` and `getMax` calls to the wrapped `Terms` instance, which will leverage info from the block tree that caches min and max, otherwise they are always going to be retrieved from the index, which does I/O and slows the can_match phase down.
javanna
added a commit
to elastic/rally-tracks
that referenced
this pull request
Jan 23, 2023
… and geonames tracks (#357) We recently found a regression that affected searches sorted by a keyword field (elastic/elasticsearch#92026). Given that we had no benchmarks for sorting by keyword, this commit adds the relevant operations to the http-logs and geonames tracks. We will want to also add similar challenges to the many-shards benchmarks, as the regressions we found can be seen with more than a couple of shards. This commit adds also specific challenges to verify the effect of elastic/elasticsearch#51852 when a search is sorted by numeric or timestamp.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit, built on top of #51708, allows to modify shard search requests based on informations collected on other shards. It is intended to speed up sorted queries on time-based indices. For queries that are only interested in the top documents.
This change will rewrite the shard queries to match none if the bottom sort value computed in prior shards is better than all values in the shard.
For queries that mix top documents and aggregations this change will reset the size of the top documents to 0 instead of rewriting to match none.
This means that we don't need to keep a search context open for this shard since we know in advance that it doesn't contain any competitive hit.
Closes #49601