Impl intoBitset for IndexedDISI and Docvalues#14529
Conversation
|
To keep things as simple as possible, I'm wondering about:
|
|
Thanks for feedback,
I won't assume But I'm not sure how much performance this approach gains, i agree to keep it simple as a first step.
I'm a bit confused on this, current patch has already been loading docs into a in-memory bitset and reuse |
I was wondering if we should go one step further and load the bitset into heap eagerly just after reading the block header, and then using it for nextDoc/advance too, instead of doing it lazily and just for intoBitSet. |
|
I'm not sure, would this cause heavy overreading in scenarios where only advance/advanceExact is required a few times in this block? E.g. we saw aggregation on hundreds of metrics with small total hits, which could be the case harmed by the overreading. |
|
I personally prefer not to eagerly load docs for advance/advanceExact, or maybe leave it to another issue/PR. I update code to one of the version during my local iteration, which reduces the complexity by using a single |
| method = Method.SPARSE; | ||
| blockEnd = slice.getFilePointer() + (numValues << 1); | ||
| nextExistDocInBlock = -1; | ||
| exists = false; |
There was a problem hiding this comment.
Is this a pre-existing bug?
There was a problem hiding this comment.
This is to make intoBitset work with advanceExact, which is not really needed. I removed this.
| bitSet.set(disi.doc - offset); | ||
| } | ||
|
|
||
| for (int i = disi.index, to = disi.nextBlockIndex; i < to; i++) { |
There was a problem hiding this comment.
I am a bit surprised that this loop starts at disi.index instead of disi.index + 1 since the doc at disi.index was already handled above?
There was a problem hiding this comment.
I changed this to make it clearer.
|
|
||
| @Override | ||
| public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException { | ||
| assert doc >= offset; |
There was a problem hiding this comment.
Can you also assert something like doc == NOT_MORE_DOCS || advanceExact(doc)? It shouldn't be legal to call intoBitSet after advanceExact returns false (or maybe a better place for this would be asserting doc values).
It looks like some implementations of intoBitsetWithinBlock try to cover the case when the current doc doesn't exist, maybe they can be simplified?
There was a problem hiding this comment.
(or maybe a better place for this would be asserting doc values
+1
It looks like some implementations of intoBitsetWithinBlock try to cover the case when the current doc doesn't exist, maybe they can be simplified?
Done :)
| } | ||
| bitSet.set(disi.doc - offset); | ||
|
|
||
| for (int i = disi.index + 1, to = disi.nextBlockIndex; i <= to; i++) { |
There was a problem hiding this comment.
I'm still a bit confused, why is disi.nextBlockIndex included in the loop when it's excluded in advanceWithinBlock?
There was a problem hiding this comment.
In advanceWithinBlock, disi.index++ is executed in the loop body before if (doc >= targetInBlock) so it is not actually excluded. advanceWithinBlock can return true when disi.index equals disi.nextBlockIndex.
There was a problem hiding this comment.
I update to the pattern like advanceWithinBlock
| public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException { | ||
| assert doc >= offset; | ||
| while (method.intoBitsetWithinBlock(this, upTo, bitSet, offset) == false) { | ||
| readBlockHeader(); |
There was a problem hiding this comment.
I wonder if we should call something like below after reading the block header (similar to what advance(target) does):
boolean found = method.advanceWithinBlock(this, block);
assert found;
This would guarantee that disi.doc is always a doc ID of the block when intoBitsetWithinBlock is called, which could help simplify some implementations of intoBitsetWithinBlock? (I'm looking at DENSE in particular)
There was a problem hiding this comment.
I like the simplification!
| * false return, fp of disi#slice is at blockEnd and disi#index is correct but other status vars | ||
| * are undefined. Caller should decode the header of next block by {@link #readBlockHeader()}. | ||
| */ | ||
| abstract boolean intoBitsetWithinBlock( |
There was a problem hiding this comment.
Nit: for consistency with how it's called on DocIdSetIterator.
| abstract boolean intoBitsetWithinBlock( | |
| abstract boolean intoBitSetWithinBlock( |
jpountz
left a comment
There was a problem hiding this comment.
I left some minor comments, other than that it looks good to me!
| boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet bitSet, int offset) | ||
| throws IOException { | ||
| if (disi.doc >= upTo) { | ||
| advanceWithinBlock(disi, disi.doc); |
There was a problem hiding this comment.
Do we really need to call advanceWithinBlock? If disi.doc is already >= upTo, there shouldn't be anything to do?
There was a problem hiding this comment.
This was useful before we advancing disi.block, but not now. I like the simplification more :)
| int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE); | ||
| int numWords = FixedBitSet.bits2words(sourceTo) - sourceStartWord; | ||
|
|
||
| if (disi.bitSet == null || disi.bitSet.getBits().length < numWords) { |
There was a problem hiding this comment.
Should we always create a bit set of size BLOCK_SIZE when it's null for simplicity? This is what users will get in the worst-case scenario anyway, and it's not crazy (65k bits = 8kB)?
| } | ||
| assertEquals(DocIdSetIterator.NO_MORE_DOCS, actual.nextDoc()); | ||
| set1.clear(); | ||
| set2.clear(); |
There was a problem hiding this comment.
Can you also check that nextDoc() returns the expected value after intoBitSet returns? To make sure that all required state is set correctly?
There was a problem hiding this comment.
This helps me find a bug: upTo could be a number in last block when RANGE do intoBitSetWithinBlock, i'm missing a doc > upTo check there.
|
|
||
| int sourceFrom = disi.doc & 0xFFFF; | ||
| int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE); | ||
| int destFrom = disi.block - offset + sourceFrom; |
There was a problem hiding this comment.
Is this the same as disi.doc - offset?
| int destFrom = disi.block - offset + sourceFrom; | ||
|
|
||
| long fp = disi.slice.getFilePointer(); | ||
| disi.slice.seek(fp - Long.BYTES); |
There was a problem hiding this comment.
nit: maybe leave a small comment explaining why we're seeking backwards, I believe that it's to include the word that contains the current doc, right?
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```
Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529
Follows: #18545
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```
Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529
Follows: #18545
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```
Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529
Follows: #18545
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```
Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529
Follows: #18545
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```
Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529
Follows: #18545
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```
Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529
Follows: #18545
Implement
intoBitsetforIndexedDISIand Docvalues.intoBitsetof Docvalues has already been called in competitive iterators, and can also be used to speed up soft delete operations.Logic of this impl is different from default impl when calling
intoBitsetafteradvanceExactreturn false, where default impl starts from a not-existing doc but this impl starts from the next doc. I do not know which one is intended, maybe this should just be undefined.relates: #14521