-
Notifications
You must be signed in to change notification settings - Fork 584
RoaringBitmap to BitSet/long[]/byte[] #713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoaringBitmap to BitSet/long[]/byte[] #713
Conversation
b532659 to
d49e7e2
Compare
d49e7e2 to
fbce977
Compare
richardstartin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very reasonable and useful, and will work without issue for most practical uses of the data structure (which will limit themselves to positive int values), but some care needs to be taken because a RoaringBitmap can contain values up to 2^32. In regard to this, I expressed some concerns about the potential to reduce allocation pressure, but the possibility of silently producing BitSets which behave strangely is my major concern.
| long[] words = toLongArray(bitmap); | ||
| ByteBuffer buffer = ByteBuffer.allocate(words.length * Long.SIZE) | ||
| .order(ByteOrder.LITTLE_ENDIAN); | ||
| buffer.asLongBuffer().put(words); | ||
| return buffer.array(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target representation is very memory intensive anyway, but I feel like it would be better to have a specialised implementation to halve the allocation pressure here, since the actual implementation isn't too complicated.
Note that because a concern was mentioned in relation to BitSet, there are no correctness concerns about storing large values because a byte[] can store 8 * ((2^31 - 1) - 2) bits ~ 2^34 > 2^32 - 1 (largest value a RoaringBitmap can store).
| * Equivalent to calling {@code BitSet.valueOf(BitSetUtil.toLongArray(bitmap))}. | ||
| */ | ||
| public static BitSet bitsetOf(RoaringBitmap bitmap) { | ||
| return BitSet.valueOf(toLongArray(bitmap)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BitSet.valueOf copies the array, but the BitSet can be told the max capacity up front, so it may be worth considering a specialised implementation here. It depends which is worse: updating a presized BitSet one bit at a time, or the allocation and the copy. I don't know which is better or worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had written up a long description of a thorny issue here which is the most important part of the review, which somehow got lost when "commenting" the review, I'm not sure I can write it out in full again, but tl;dr BitSet does weird things when constructed from long[] containing values >= Integer.MAX_VALUE, which toLongArray() can produce. The impact includes BitSet#cardinality() being correct, but inconsistent with iteration, idiomatic iteration loops throwing exceptions and so on. This can be demonstrated with this code:
long[] longArray = new long[(Integer.MAX_VALUE >>> 6) + 2];
longArray[longArray.length - 1] = -1;
longArray[0] = -1;
BitSet bitSet = BitSet.valueOf(longArray);
assertEquals(128, bitSet.cardinality());
int computedCardinality = 0;
int next = 0;
while (true) {
// will throw at the next iteration after producing -2147483648
next = bitSet.nextSetBit(next + 1);
if (next == -1) {
break;
}
computedCardinality++;
}
// will not get here, but this would fail no matter how the loop works around what's above
assertEquals(bitSet.cardinality(), computedCardinality);This is apparently already known about: JDK-8311905 but it doesn't appear to be easy to fix, so we need to avoid creating BitSets which behave strangely by checking the last value in the bitmap is < Integer.MAX_VALUE. This can be done by checking the 16th bit of the last key in the RoaringArray is absent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's context I wasn't aware of, super useful, thanks. I'll add a check for that to make sure we don't get into trouble. I thought about the signed values, but figured that if you're using BitSet it's for interoperability/speed and you're already under Integer.MAX_VALUE.
I went with the allocation/copy because I wanted valueOf to be fast as possible. I'm using this code already with MethodHandles to get at the bitmap field in my own code. I can add a variant that uses forEach(IntConsumer), it looks like that should be the next fastest approach available.
We're using Daniel's approach for iterating over BitSet because it's about three times faster than the example for nextSetBit even with the array copy (for our use case, our largest bitmap is 7,439 words long):
public final void bitIndexes(IntConsumer consumer) {
long[] words = toLongArray();
for (int k = 0; k < words.length; ++k) {
long word = words[k];
while (word != 0) {
long t = word & Long.lowestOneBit(word);
int r = Long.numberOfTrailingZeros(t);
consumer.accept(k * Long.SIZE + r);
word ^= t;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, you could speed that up quite a lot by replacing the call to lowestOneBit (which isn’t an intrinsic and used quite a few instructions to isolate the bit last time I checked) with just numberOfTrailingZeros, as they are equivalent, then replacing the xor with word &= (word - 1) to unset the lowest bit, which is compiled to blsr on x86.
on the review, yes. I think adding the check and throwing when there is a value too large would suffice.
| return new long[0]; | ||
| } | ||
|
|
||
| int lastBit = Math.max(bitmap.last(), Long.SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bitmap.last() can produce a negative int for bitmaps with values >= 2^31, so it should be masked with 0xFFFF_FFFFL and the result should be a long, with minor consequences for the arithmetic below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound like I can address both comments by just bailing if bitmap.last() < 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. A long[] has enough bits to store any RoaringBitmap, so it could produce a result here, in which case your arithmetic needs to handle values in the [2^31, 2^32) range. Or you could just check and bail and someone else can add support for larger bitmaps if they really want it down the line.
| assert pointer.getContainer() == null; | ||
| assert position == wordsInUse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
|
@DanielThomas @richardstartin How does this PR differs from another outstanding PR... #661 |
…that avoids an intermediate array
The |
|
Let us run tests. Once they pass, we can merge and release. I have closed the other issue. This looks like a fantastic contribution. |
|
Merging. This will be part of the next release. |
SUMMARY
Add methods to allow direct conversion of a RoaringBitmap to a BitSet. This turned out to not be as simple as it first appeared, and the existing tests were a godsend.
This performs quite well too, I see over 5GB/s on my M1 MacBook Pro for 86,884 ops/s.
Automated Checks
./gradlew testand made sure that my PR does not break any unit test../gradlew checkstyleMainor the equivalent and corrected the formatting warnings reported.