-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Optimize count() for BooleanQuery disjunction #12358
Description
Description
Context: we (Amazon customer facing product search team, and also AWS) are attempting to understand the amazing performance Tantivy (Rust search engine) has over Lucene, iterating in this GitHub repo. That repo is sort of a merger of Lucene's benchmarking code (luceneutil), including its tasks and enwiki corpus, and the open source Tantivy benchmark. Tantivy is impressively fast :)
This issue is a spinoff from this fascinating comment by @fulmicoton, creator and maintainer of Tantivy.
Tantivy optimizes count() for BooleanQuery disjunctions much like Lucene's BooleanScorer, by scoring in a windowed bitset of N docs at once, and then pop-counting the set bits in each window. This is not technically a sub-linear implementation: it is still linear, but I suspect with a smaller constant factor than the default count() fallback Lucene implements.
Perhaps, for all cases where BooleanQuery uses the windowed BooleanScorer, we could also implement this count() optimization.
From my read of Lucene's BooleanWeight.count, I don't think Lucene has this optimization? Maybe we should port over Tantivy's optimization? It should make disjunctive counting quite a bit faster?