-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps #7346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cyb70289 since you've been doing bitmap stuff lately I would welcome your code review of this |
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly changes around naming which I think will make the semantics of the class clearer and reading of its use easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely to be a common pattern you might consider exposing it via either callbacks on the class or with an adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this can be refined as it relates to kernel execution
cpp/src/arrow/util/bit_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this have different semantics on big endian?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shift_word helper is used in other functions -- the unit tests pass on big endian it seems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more tests with random data
|
Thanks for the comments, I think this is much better now |
| class ARROW_EXPORT BitBlockCounter { | ||
| public: | ||
| struct Block { | ||
| int16_t length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you check if this affects benchmarks, I think popcount returns ints so there might be some miniscule advantage to keep it at a larger size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't seem to affect them
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks reasonable to me.
| total_popcount += __builtin_popcountll(shift_word(current, next, offset_)); | ||
| current = next; | ||
| next = load_word(bitmap_ + 32); | ||
| total_popcount += __builtin_popcountll(shift_word(current, next, offset_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the implementation.
Just thinking if it would be faster to drop "shift_word" with below steps:
- simply accumulate total pops of 4 continuous uint64
- mask first byte with offset bitmask, count pops, minus from total pops
- mask next byte after last uint64 with offset bitmask, count pops, add to total pops
I think we can leave deeper optimization as follow up tasks. And I guess we can handle larger blocks with SIMD more efficiently.
Breaking large blocks to smaller trunks to leverage non-null processing is a great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raised an interesting question of should this be tied in with existing popcount methods and the improvements applied in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to tackle this as follow up, but this is a good starting point in any case.
| if (block.length == 0) { | ||
| break; | ||
| } | ||
| if (block.length == block.popcount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically you don't need a popcount at all for this optimization. You just need to check whether all bits in the block are 1, or whether all bits in the block are 0. This should be trivial (and could perhaps be folded in VisitBits and/or VisitArrayDataInline).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the popcount in order to calculate the number of "on" elements for filters, but yes for this particular optimization between "all-on" and "not-all-on" you don't need the popcount.
|
Note, I would like to have a binary version of this also. I just opened https://issues.apache.org/jira/browse/ARROW-9034 |
The purpose of this class is to scan validity bitmaps in segments of 256 bits at a time (a "run") and return the number of true values using popcount hardware instrinsics. Processing code can then switch between nullable / non-nullable processing paths -- the non-nullable paths are often much faster as they don't have to branch or check individual bits.
In the benchmark I wrote here, this strategy starts to become faster than using BitmapReader naively with a null density somewhere between 2% and 10%. I implemented a naive "sum non-null values" algorithm using BitBlockCounter versus a similarly naive version that uses BitmapReader. The benchmark state parameter is the average number of array values for each null
So we can see that the performance is around the same when 1 in 8 values is null, but when 1 out of 512 is null, the block-counter version is 3x faster. And the performance goes up from there, up to 50x faster on data that has nulls but not very many. In my experience, data with < 1% nulls is extremely common, much more so than data with 5% or more nulls. This is obviously a tradeoff but IMHO one worth making.
As a bonus,
BitBlockCounterdoesn't inline any code.The implementation of this can probably be improved and the benchmark as well so I welcome your collective help with this.