-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Make Lucene smarter about long runs of matches #11915
Description
Description
Lucene's abstractions are good at dealing with long runs of documents that do not match a query, but much less at dealing with long runs of documents that match a query. In such cases, Lucene still needs to linearly scan these matches and do some amount of work for each of the matches.
Is it actually common to have long runs of matches? For full-text indexes, maybe not so much, only stop words may have runs of adjacent matches. For string fields, this may happen if the field has a default value that is the value of most documents in the collection. Also it's possible for users to use index sorting in order to cluster similar documents together, which increases the likelihood to have long runs of adjacent matches.
One idea would be to augment the DocIdSetIterator API to add a new peekNextNonMatchingDocID method, which would return the next doc ID that may not be a match. The default implementation could return docID() + 1. We'd need to implement this API in postings, doc-value iterators and a few other important DocIdSetIterators like BitSetIterator and DocIdSetIterator#all. And propagate this information in disjunctions and conjunctions. Then we could leverage this API in a number of places:
ReqExclScorercould ask the prohibited clause for the next doc ID that is worth evaluating on the required clause.- Conjunctions could ignore non-scoring clauses over ranges of doc IDs that all match.
FixedBitSet#orcould set ranges of doc IDs at once, which would in-turn speed upDocIdSetBuilder.