-
Notifications
You must be signed in to change notification settings - Fork 584
Roaring bitmap to JDK bit set conversion #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * @param bitmap original bitmap | ||
| * @return bit set equivalent to roaring bitmap | ||
| */ | ||
| public static BitSet bitSetOf(ImmutableRoaringBitmap bitmap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we implement this for ImmutableRoaringBitmap, we should implement it for RoaringBitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about to use their common ancestor ImmutableBitmapDataProvider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImmutableBitmapDataProvider is an interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean use one of OOP principle "code against interface, not implementation", but I you suggest, I will duplicate the code for RoaringBitmap.
RoaringBitmap/src/test/java/org/roaringbitmap/buffer/TestBitSetUtil.java
Outdated
Show resolved
Hide resolved
|
|
||
| /** | ||
| * Converts an immutable roaring bitmap to JDK bit set. | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be documented as being an inefficient implementation, until we optimize it. I am fine with a 'convenient' but slow routine, but we need to tell the users that it is slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created some benchmark to optimize conversion. There is trade off between too much code and performance. What variant do you suggest to choose? Should I create separate PR for this or it can be included here?
| Benchmark | inputData | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| iterateThroughContainers | CONSECUTIVE | avgt | 5 | 52.782 | ±3.963 | ms/op |
| iterateThroughContainers | RANDOM | avgt | 5 | 66.945 | ±8.262 | ms/op |
| negativeValuesSeparately | CONSECUTIVE | avgt | 5 | 62.152 | ±1.104 | ms/op |
| negativeValuesSeparately | RANDOM | avgt | 5 | 85.505 | ±12.954 | ms/op |
| negativeValuesSeparatelyRegisterPreviousValue | CONSECUTIVE | avgt | 5 | 49.961 | ±8.429 | ms/op |
| negativeValuesSeparatelyRegisterPreviousValue | RANDOM | avgt | 5 | 185.567 | ±5.139 | ms/op |
| preallocated | CONSECUTIVE | avgt | 5 | 67.221 | ±1.658 | ms/op |
| preallocated | RANDOM | avgt | 5 | 78.673 | ±5.566 | ms/op |
| original | CONSECUTIVE | avgt | 5 | 75.124 | ±27.983 | ms/op |
| original | RANDOM | avgt | 5 | 84.717 | ±3.251 | ms/op |
| streamingApi | CONSECUTIVE | avgt | 5 | 38 | ± | ms/op |
| streamingApi | RANDOM | avgt | 5 | 85 | ± | ms/op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variant using streaming API inspired by #556 was added afterwards and it is magically the fastest for consecutive values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is trade off between too much code and performance. What variant do you suggest to choose?
My concern here was that the library user should know what to expect. If we provide a convenience function that is possibly slow, we should warn them... otherwise it can be considered a performance bug.
Not everything needs to be super optimized with complex code, but the user should get some warning if the performance can be poor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what variant you suggest we should use? As I wrote earlier, I have created benchmarks for this issue, where are already optimized implementations. Why I ask you what to choose? If you look on benchmark results, there is no solution dominating for both tested cases - for consecutive values is the best variant using streamingApi and for random values iterateThroughContainers. Your suggestions can be included also. After that I can merge chosen one into this PR and do some code clean up - at least comments as they are wrong if I remember right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. The issue is that your implementation is fine, but you do bit-by-bit conversion.
If the RoaringBitmap contains a bitset container, for example. Then you can just use valueOf(long[] longs) to map directly, at a highspeed, the content of the Roaring container to a JDK BitSet. You are obviously aware of this because you have it in a comment.
I don't think we even need benchmarks, it is going to be much faster...
Now. It is fine not implementing this, but I would request that you then document the performance in the document of the function. (E.g., "this function may set the bits one by one and may be computationally intensive".) That's important because some users might decide to systematically convert their Roaring bitmaps to BitSet. We need to warn them about the performance implications.
The Go roaring implementation and the C Roaring implementations have equivalent functionality btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the C implementation: https://github.com/RoaringBitmap/CRoaring/blob/daad140f013cbb4e8b515da264403843d8dc1755/src/roaring.c#L3306
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the Go implementation:
https://github.com/RoaringBitmap/roaring/blob/88e9207dc61ee814891a2a7cce4aa8abf1fbac1f/roaring.go#L186
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not asking you to use these sorts of implementations. They require more testing and more effort. I am just asking you to send the users a signal regarding how it is currently implemented.
lemire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could merge this, but I'd like you to add a comment (maybe just one sentence) in the function description, so that the users know that this do bit-by-bit construction. Further, we should warn them that the memory usage of the result could be significant.
Further, we would want to have the equivalent for ImmutableRoaringBitmap.
|
@xtonik Let me repeat my request. If I were a user of this library, I would assume that a library-provided function that converts a roaring bitmap into a BitSet would try, at least in some cases, to be highly efficient. The implementation you propose is fine, but it will have quite low efficiency in some instances. We need to warn the users so that they cannot complain later if their performance is poor. |
|
We are going to go with PR #713 as it provides better performance. Thanks for all your contributions, and keep them coming !!! |
SUMMARY
Added missing routine for conversion RoaringBitmap to BitSet from JDK. Done in very simple - iterating value by value, it is far from optimal possible solution.
Automated Checks
./gradlew testand made sure that my PR does not break any unit test../gradlew checkstyleMainor the equivalent and corrected the formatting warnings reported.