Skip to content

Impl intoBitset for IndexedDISI and Docvalues#14529

Merged
gf2121 merged 19 commits intoapache:mainfrom
gf2121:dv_into_bitset
Apr 24, 2025
Merged

Impl intoBitset for IndexedDISI and Docvalues#14529
gf2121 merged 19 commits intoapache:mainfrom
gf2121:dv_into_bitset

Conversation

@gf2121
Copy link
Contributor

@gf2121 gf2121 commented Apr 21, 2025

Implement intoBitset for IndexedDISI and Docvalues.

intoBitset of Docvalues has already been called in competitive iterators, and can also be used to speed up soft delete operations.

Logic of this impl is different from default impl when calling intoBitset after advanceExact return false, where default impl starts from a not-existing doc but this impl starts from the next doc. I do not know which one is intended, maybe this should just be undefined.

relates: #14521

@jpountz
Copy link
Contributor

jpountz commented Apr 22, 2025

To keep things as simple as possible, I'm wondering about:

  • Using a very simple/robust impl in the SPARSE case that moves to the next doc in a loop until either the end of block or upTo, since it should never be a bottleneck (especially as DenseConjunctionBulkScorer is only used on dense iterators)?
  • Refactoring DENSE to load doc IDs into an in-memory bit set all the time, so that intoBitSet could directly reuse FixedBitSet.orRange.

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 22, 2025

Thanks for feedback,

Using a very simple/robust impl in the SPARSE case that moves to the next doc in a loop until either the end of block or upTo, since it should never be a bottleneck (especially as DenseConjunctionBulkScorer is only used on dense iterators)?

I won't assume intoBitset will only be called by DenseConjunctionBulkScorer. I was also thinking about using it for soft delete (#14521) or docvalues' merge. IndexedDISI#writeBitset will always load the whole block of docs into a bitset.

But I'm not sure how much performance this approach gains, i agree to keep it simple as a first step.

Refactoring DENSE to load doc IDs into an in-memory bit set all the time, so that intoBitSet could directly reuse FixedBitSet.orRange.

I'm a bit confused on this, current patch has already been loading docs into a in-memory bitset and reuse orRange, could you be more specific?

@jpountz
Copy link
Contributor

jpountz commented Apr 22, 2025

I'm a bit confused on this, current patch has already been loading docs into a in-memory bitset and reuse orRange, could you be more specific?

I was wondering if we should go one step further and load the bitset into heap eagerly just after reading the block header, and then using it for nextDoc/advance too, instead of doing it lazily and just for intoBitSet.

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 23, 2025

I'm not sure, would this cause heavy overreading in scenarios where only advance/advanceExact is required a few times in this block? E.g. we saw aggregation on hundreds of metrics with small total hits, which could be the case harmed by the overreading.

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 23, 2025

I personally prefer not to eagerly load docs for advance/advanceExact, or maybe leave it to another issue/PR.

I update code to one of the version during my local iteration, which reduces the complexity by using a single orRange instead of multiple round of orRange. I wonder if this will make sense to you.

method = Method.SPARSE;
blockEnd = slice.getFilePointer() + (numValues << 1);
nextExistDocInBlock = -1;
exists = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a pre-existing bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make intoBitset work with advanceExact, which is not really needed. I removed this.

bitSet.set(disi.doc - offset);
}

for (int i = disi.index, to = disi.nextBlockIndex; i < to; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit surprised that this loop starts at disi.index instead of disi.index + 1 since the doc at disi.index was already handled above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to make it clearer.


@Override
public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
assert doc >= offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also assert something like doc == NOT_MORE_DOCS || advanceExact(doc)? It shouldn't be legal to call intoBitSet after advanceExact returns false (or maybe a better place for this would be asserting doc values).

It looks like some implementations of intoBitsetWithinBlock try to cover the case when the current doc doesn't exist, maybe they can be simplified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or maybe a better place for this would be asserting doc values

+1

It looks like some implementations of intoBitsetWithinBlock try to cover the case when the current doc doesn't exist, maybe they can be simplified?

Done :)

}
bitSet.set(disi.doc - offset);

for (int i = disi.index + 1, to = disi.nextBlockIndex; i <= to; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused, why is disi.nextBlockIndex included in the loop when it's excluded in advanceWithinBlock?

Copy link
Contributor Author

@gf2121 gf2121 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In advanceWithinBlock, disi.index++ is executed in the loop body before if (doc >= targetInBlock) so it is not actually excluded. advanceWithinBlock can return true when disi.index equals disi.nextBlockIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update to the pattern like advanceWithinBlock

public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
assert doc >= offset;
while (method.intoBitsetWithinBlock(this, upTo, bitSet, offset) == false) {
readBlockHeader();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call something like below after reading the block header (similar to what advance(target) does):

boolean found = method.advanceWithinBlock(this, block);
assert found;

This would guarantee that disi.doc is always a doc ID of the block when intoBitsetWithinBlock is called, which could help simplify some implementations of intoBitsetWithinBlock? (I'm looking at DENSE in particular)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification!

* false return, fp of disi#slice is at blockEnd and disi#index is correct but other status vars
* are undefined. Caller should decode the header of next block by {@link #readBlockHeader()}.
*/
abstract boolean intoBitsetWithinBlock(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for consistency with how it's called on DocIdSetIterator.

Suggested change
abstract boolean intoBitsetWithinBlock(
abstract boolean intoBitSetWithinBlock(

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments, other than that it looks good to me!

boolean intoBitSetWithinBlock(IndexedDISI disi, int upTo, FixedBitSet bitSet, int offset)
throws IOException {
if (disi.doc >= upTo) {
advanceWithinBlock(disi, disi.doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to call advanceWithinBlock? If disi.doc is already >= upTo, there shouldn't be anything to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was useful before we advancing disi.block, but not now. I like the simplification more :)

int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
int numWords = FixedBitSet.bits2words(sourceTo) - sourceStartWord;

if (disi.bitSet == null || disi.bitSet.getBits().length < numWords) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always create a bit set of size BLOCK_SIZE when it's null for simplicity? This is what users will get in the worst-case scenario anyway, and it's not crazy (65k bits = 8kB)?

}
assertEquals(DocIdSetIterator.NO_MORE_DOCS, actual.nextDoc());
set1.clear();
set2.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check that nextDoc() returns the expected value after intoBitSet returns? To make sure that all required state is set correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps me find a bug: upTo could be a number in last block when RANGE do intoBitSetWithinBlock, i'm missing a doc > upTo check there.

@gf2121 gf2121 requested a review from jpountz April 24, 2025 11:28

int sourceFrom = disi.doc & 0xFFFF;
int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
int destFrom = disi.block - offset + sourceFrom;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as disi.doc - offset?

int destFrom = disi.block - offset + sourceFrom;

long fp = disi.slice.getFilePointer();
disi.slice.seek(fp - Long.BYTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe leave a small comment explaining why we're seeking backwards, I believe that it's to include the word that contains the current doc, right?

@gf2121 gf2121 merged commit e2e44bc into apache:main Apr 24, 2025
7 checks passed
matriv added a commit to crate/crate that referenced this pull request Oct 21, 2025
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```

Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529

Follows: #18545
matriv added a commit to crate/crate that referenced this pull request Oct 21, 2025
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```

Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529

Follows: #18545
matriv added a commit to crate/crate that referenced this pull request Oct 21, 2025
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```

Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529

Follows: #18545
matriv added a commit to crate/crate that referenced this pull request Oct 21, 2025
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```

Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529

Follows: #18545
matriv added a commit to crate/crate that referenced this pull request Oct 21, 2025
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```

Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529

Follows: #18545
mergify bot pushed a commit to crate/crate that referenced this pull request Oct 21, 2025
Apply changes made between 10.2.2 and 10.3.1 versions:
```
$ g log --oneline releases/lucene/10.2.2..releases/lucene/10.3.1 lucene/core/src/java/org/apache/lucene/codecs/lucene90/{Lucene90DocValuesProducer.java,Lucene90DocValuesConsumer.java,Lucene90DocValuesFormat.java}
d176a42b659 Implement IndexedDISI#docIDRunEnd (#14753)
223d7a41e3c [10.x] Change uses of withReadAdvice to use hints instead (#14629) (#14510)
bb3167e57c6 Impl intoBitset for IndexedDISI and Docvalues (#14529)
```

Relevant PRs:
apache/lucene#14753
apache/lucene#14510
apache/lucene#14529

Follows: #18545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants