Optimize composite aggregation based on index sorting#48399
Conversation
…sort is applicable. Add index sort uts.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
|
@howardhuanghua I couldn't push directly to your branch so I opened a new pr that retains your commit. I hope you don't mind if we continue to work here since we both have the right to push modifications. I merged the two approaches that we discussed in the original pr so I think this pr is ready for a review. We still need to update the documentation but I wanted to get your opinion first. |
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Show resolved
Hide resolved
|
Hi @jimczi , I am ok and thanks for your merging. It's my pleasure that I could contribute on this optimization with you together. I left a comment about the merged code, please help to check. Thank you. |
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Outdated
Show resolved
Hide resolved
...st/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregatorTests.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Outdated
Show resolved
Hide resolved
|
Thanks for looking @jpountz and @howardhuanghua . I pushed some commits to address your comments and added a section in the docs to explain the early termination. Could you take another look ? |
| for (Query query : queries) { | ||
| executeTestCase(false, query, dataset, create, verify); | ||
| executeTestCase(true, query, dataset, create, verify); | ||
| executeTestCase(false, false, query, dataset, create, verify); |
There was a problem hiding this comment.
Since these two TestCase set useIndexSort as false, then it could not be early terminated. This would make testEarlyTermination() failed.
There was a problem hiding this comment.
testEarlyTermination uses executeTestCase directly and sets useIndexSort to true so I think we're good here. However I found a test bug when the AssertingCodec is used, this codec wraps docvalues and that breaks DocValues#unwrapSingleton contract. So I pushed 1fa763b to suppress this code in the early termination tests.
There was a problem hiding this comment.
Even suppressed codecs, it seems still got the chance to use
keyword -> {AssertingDocValuesFormat$AssertingDocValuesProducer@4968}. And testEarlyTermination would failed occasionally.
There was a problem hiding this comment.
Good catch thanks, I pushed b52d5c7 to force the suppression of the Asserting codec but that means that the test is ignored when the randomization choose this codec. I need to dig a bit more to understand how we can ensure that the test always runs.
There was a problem hiding this comment.
Hi @jimczi , I have checked that system properties like tests.codec, tests.docvaluesformat and tests.postingsformat default value is random in LuceneTestCase.java, AssertingCodec would be selected.
It seems we could set default Lucene80 codec directly in executeTestCase method as follow to solve the problem:
if (indexSort != null) {
config.setCodec(TestUtil.getDefaultCodec()); // set codec as default to avoid AssertingCodec
config.setIndexSort(indexSort);
}
|
|
||
| The `composite` aggregation can early terminate the collection under some circumstances: | ||
|
|
||
| * If the primary source extracts values from a numeric or a keyword field and the query matches all documents (`match_all` query). |
There was a problem hiding this comment.
Early terminate optimization could also be used for non match_all query?
There was a problem hiding this comment.
Only if the index is sorted, yes. I also wanted to mention that we early terminate if the index is not sorted but I see how it's confusing now. I pushed 1fa763b to reword the paragraph to emphasize on index sorting first.
...st/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregatorTests.java
Show resolved
Hide resolved
|
I pushed another commit to address the remaining TODOs and the reviews. There is still one TODO left regarding how we could handle |
...java/org/elasticsearch/search/aggregations/bucket/composite/CompositeValuesSourceConfig.java
Outdated
Show resolved
Hide resolved
...java/org/elasticsearch/search/aggregations/bucket/composite/CompositeValuesSourceConfig.java
Show resolved
Hide resolved
|
Hi @jimczi , LGTM. Thanks for your great efforts to fix these issues. |
|
Hi @jimczi, would you please help to merge this PR? Or we need further review? Thank you. |
iverase
left a comment
There was a problem hiding this comment.
I did a first pass and left some questions. I will be doing a second pass and focus on the tests.
...rc/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregator.java
Show resolved
Hide resolved
...st/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregatorTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/searchafter/SearchAfterBuilder.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/internal/ContextIndexSearcher.java
Outdated
Show resolved
Hide resolved
iverase
left a comment
There was a problem hiding this comment.
The logic makes sense to me. lgtm
|
Thanks @howardhuanghua and sorry for the delay. I merged in master and added you as co-author. I'll now backport to 7.x. |
Co-authored-by: Daniel Huang <danielhuang@tencent.com> This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification. In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue. The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases: * Multi-valued source, in such case early termination is not possible. * missing_bucket is set to true
|
@jimczi It's my pleasure to co-work with you on this optimization. Thanks for for your help. |
Co-authored-by: Daniel Huang <danielhuang@tencent.com> This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification. In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue. The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases: * Multi-valued source, in such case early termination is not possible. * missing_bucket is set to true
Co-authored-by: Daniel Huang <danielhuang@tencent.com> This is a spinoff of elastic#48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification. In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue. The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases: * Multi-valued source, in such case early termination is not possible. * missing_bucket is set to true
| private RoaringDocIdSet.Builder docIdSetBuilder; | ||
| private BucketCollector deferredCollectors; | ||
|
|
||
| private boolean earlyTerminated; |
This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than
match_allis provided. However the optimization is deactivated for sources that match the index sort in the following cases:missing_bucketis set to trueI retained the commit from #48130 and merged the original idea that @howardhuanghua described in the pr to early terminate even when the sort is reversed in the leading source.