Skip to content

Conversation

@xtonik
Copy link
Contributor

@xtonik xtonik commented Oct 29, 2023

SUMMARY

Added missing routine for conversion RoaringBitmap to BitSet from JDK. Done in very simple - iterating value by value, it is far from optimal possible solution.

Automated Checks

  • I have run ./gradlew test and made sure that my PR does not break any unit test.
  • I have run ./gradlew checkstyleMain or the equivalent and corrected the formatting warnings reported.

* @param bitmap original bitmap
* @return bit set equivalent to roaring bitmap
*/
public static BitSet bitSetOf(ImmutableRoaringBitmap bitmap) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement this for ImmutableRoaringBitmap, we should implement it for RoaringBitmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to use their common ancestor ImmutableBitmapDataProvider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableBitmapDataProvider is an interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean use one of OOP principle "code against interface, not implementation", but I you suggest, I will duplicate the code for RoaringBitmap.


/**
* Converts an immutable roaring bitmap to JDK bit set.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented as being an inefficient implementation, until we optimize it. I am fine with a 'convenient' but slow routine, but we need to tell the users that it is slow.

Copy link
Contributor Author

@xtonik xtonik Nov 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created some benchmark to optimize conversion. There is trade off between too much code and performance. What variant do you suggest to choose? Should I create separate PR for this or it can be included here?

Benchmark inputData Mode Cnt Score Error Units
iterateThroughContainers CONSECUTIVE avgt 5 52.782 ±3.963 ms/op
iterateThroughContainers RANDOM avgt 5 66.945 ±8.262 ms/op
negativeValuesSeparately CONSECUTIVE avgt 5 62.152 ±1.104 ms/op
negativeValuesSeparately RANDOM avgt 5 85.505 ±12.954 ms/op
negativeValuesSeparatelyRegisterPreviousValue CONSECUTIVE avgt 5 49.961 ±8.429 ms/op
negativeValuesSeparatelyRegisterPreviousValue RANDOM avgt 5 185.567 ±5.139 ms/op
preallocated CONSECUTIVE avgt 5 67.221 ±1.658 ms/op
preallocated RANDOM avgt 5 78.673 ±5.566 ms/op
original CONSECUTIVE avgt 5 75.124 ±27.983 ms/op
original RANDOM avgt 5 84.717 ±3.251 ms/op
streamingApi CONSECUTIVE avgt 5 38 ± ms/op
streamingApi RANDOM avgt 5 85 ± ms/op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variant using streaming API inspired by #556 was added afterwards and it is magically the fastest for consecutive values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trade off between too much code and performance. What variant do you suggest to choose?

My concern here was that the library user should know what to expect. If we provide a convenience function that is possibly slow, we should warn them... otherwise it can be considered a performance bug.

Not everything needs to be super optimized with complex code, but the user should get some warning if the performance can be poor.

Copy link
Contributor Author

@xtonik xtonik Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what variant you suggest we should use? As I wrote earlier, I have created benchmarks for this issue, where are already optimized implementations. Why I ask you what to choose? If you look on benchmark results, there is no solution dominating for both tested cases - for consecutive values is the best variant using streamingApi and for random values iterateThroughContainers. Your suggestions can be included also. After that I can merge chosen one into this PR and do some code clean up - at least comments as they are wrong if I remember right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The issue is that your implementation is fine, but you do bit-by-bit conversion.
If the RoaringBitmap contains a bitset container, for example. Then you can just use valueOf(long[] longs) to map directly, at a highspeed, the content of the Roaring container to a JDK BitSet. You are obviously aware of this because you have it in a comment.

I don't think we even need benchmarks, it is going to be much faster...

Now. It is fine not implementing this, but I would request that you then document the performance in the document of the function. (E.g., "this function may set the bits one by one and may be computationally intensive".) That's important because some users might decide to systematically convert their Roaring bitmaps to BitSet. We need to warn them about the performance implications.

The Go roaring implementation and the C Roaring implementations have equivalent functionality btw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not asking you to use these sorts of implementations. They require more testing and more effort. I am just asking you to send the users a signal regarding how it is currently implemented.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could merge this, but I'd like you to add a comment (maybe just one sentence) in the function description, so that the users know that this do bit-by-bit construction. Further, we should warn them that the memory usage of the result could be significant.

Further, we would want to have the equivalent for ImmutableRoaringBitmap.

@lemire
Copy link
Member

lemire commented Jan 28, 2024

@xtonik Let me repeat my request. If I were a user of this library, I would assume that a library-provided function that converts a roaring bitmap into a BitSet would try, at least in some cases, to be highly efficient. The implementation you propose is fine, but it will have quite low efficiency in some instances. We need to warn the users so that they cannot complain later if their performance is poor.

@lemire
Copy link
Member

lemire commented Mar 25, 2024

We are going to go with PR #713 as it provides better performance. Thanks for all your contributions, and keep them coming !!!

@lemire lemire closed this Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants