Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 3, 2020

The purpose of this class is to scan validity bitmaps in segments of 256 bits at a time (a "run") and return the number of true values using popcount hardware instrinsics. Processing code can then switch between nullable / non-nullable processing paths -- the non-nullable paths are often much faster as they don't have to branch or check individual bits.

In the benchmark I wrote here, this strategy starts to become faster than using BitmapReader naively with a null density somewhere between 2% and 10%. I implemented a naive "sum non-null values" algorithm using BitBlockCounter versus a similarly naive version that uses BitmapReader. The benchmark state parameter is the average number of array values for each null

---------------------------------------------------------------------------------
Benchmark                                          Time           CPU Iterations
---------------------------------------------------------------------------------
BitBlockCounterSumNotNull/8                  1566649 ns    1566653 ns        450   638.304M items/s
BitBlockCounterSumNotNull/64                  926258 ns     926257 ns        745   1079.61M items/s
BitBlockCounterSumNotNull/512                 355224 ns     355224 ns       1978   2.74915G items/s
BitBlockCounterSumNotNull/4096                 77070 ns      77070 ns       8597   12.6711G items/s
BitBlockCounterSumNotNull/32768                20838 ns      20838 ns      32902   46.8638G items/s
BitBlockCounterSumNotNull/65536                18538 ns      18538 ns      37300   52.6791G items/s
BitBlockCounterSumNotNullWithOffset/8        1563335 ns    1563346 ns        426   639.654M items/s
BitBlockCounterSumNotNullWithOffset/64        831714 ns     831720 ns        825   1.17415G items/s
BitBlockCounterSumNotNullWithOffset/512       339929 ns     339931 ns       2035   2.87283G items/s
BitBlockCounterSumNotNullWithOffset/4096       75150 ns      75148 ns       9349   12.9952G items/s
BitBlockCounterSumNotNullWithOffset/32768      28726 ns      28727 ns      24736   33.9948G items/s
BitBlockCounterSumNotNullWithOffset/65536      26921 ns      26921 ns      26026   36.2753G items/s
BitmapReaderSumNotNull/8                     1897087 ns    1897098 ns        368   527.121M items/s
BitmapReaderSumNotNull/64                    1050133 ns    1050134 ns        669    952.26M items/s
BitmapReaderSumNotNull/512                    960722 ns     960728 ns        744   1040.88M items/s
BitmapReaderSumNotNull/4096                   949578 ns     949584 ns        727   1053.09M items/s
BitmapReaderSumNotNull/32768                  946948 ns     946955 ns        722   1056.02M items/s
BitmapReaderSumNotNull/65536                  960649 ns     960637 ns        739   1040.98M items/s
BitmapReaderSumNotNullWithOffset/8           1972476 ns    1972457 ns        350   506.982M items/s
BitmapReaderSumNotNullWithOffset/64          1131682 ns    1131691 ns        636   883.634M items/s
BitmapReaderSumNotNullWithOffset/512          991733 ns     991736 ns        729   1008.33M items/s
BitmapReaderSumNotNullWithOffset/4096         982855 ns     982862 ns        719   1017.44M items/s
BitmapReaderSumNotNullWithOffset/32768        983555 ns     983556 ns        691   1016.72M items/s
BitmapReaderSumNotNullWithOffset/65536       1005651 ns    1005658 ns        701   994.374M items/s

So we can see that the performance is around the same when 1 in 8 values is null, but when 1 out of 512 is null, the block-counter version is 3x faster. And the performance goes up from there, up to 50x faster on data that has nulls but not very many. In my experience, data with < 1% nulls is extremely common, much more so than data with 5% or more nulls. This is obviously a tradeoff but IMHO one worth making.

As a bonus, BitBlockCounter doesn't inline any code.

The implementation of this can probably be improved and the benchmark as well so I welcome your collective help with this.

@wesm wesm requested review from fsaintjacques and pitrou June 3, 2020 23:43
@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

@cyb70289 since you've been doing bitmap stuff lately I would welcome your code review of this

@github-actions
Copy link

github-actions bot commented Jun 3, 2020

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly changes around naming which I think will make the semantics of the class clearer and reading of its use easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely to be a common pattern you might consider exposing it via either callbacks on the class or with an adapter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this can be refined as it relates to kernel execution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this have different semantics on big endian?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shift_word helper is used in other functions -- the unit tests pass on big endian it seems

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more tests with random data

@wesm wesm changed the title ARROW-9029: [C++] Implement BitmapScanner for much faster processing of data without many nulls ARROW-9029: [C++] Implement BitBlockReader for much fast block popcounts of bitmaps Jun 4, 2020
@wesm wesm changed the title ARROW-9029: [C++] Implement BitBlockReader for much fast block popcounts of bitmaps ARROW-9029: [C++] Implement BitBlockCounter for much fast block popcounts of bitmaps Jun 4, 2020
@wesm
Copy link
Member Author

wesm commented Jun 4, 2020

Thanks for the comments, I think this is much better now

class ARROW_EXPORT BitBlockCounter {
public:
struct Block {
int16_t length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check if this affects benchmarks, I think popcount returns ints so there might be some miniscule advantage to keep it at a larger size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't seem to affect them

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks reasonable to me.

@wesm wesm changed the title ARROW-9029: [C++] Implement BitBlockCounter for much fast block popcounts of bitmaps ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps Jun 4, 2020
total_popcount += __builtin_popcountll(shift_word(current, next, offset_));
current = next;
next = load_word(bitmap_ + 32);
total_popcount += __builtin_popcountll(shift_word(current, next, offset_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the implementation.

Just thinking if it would be faster to drop "shift_word" with below steps:

  1. simply accumulate total pops of 4 continuous uint64
  2. mask first byte with offset bitmask, count pops, minus from total pops
  3. mask next byte after last uint64 with offset bitmask, count pops, add to total pops

I think we can leave deeper optimization as follow up tasks. And I guess we can handle larger blocks with SIMD more efficiently.
Breaking large blocks to smaller trunks to leverage non-null processing is a great idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raised an interesting question of should this be tied in with existing popcount methods and the improvements applied in both places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to tackle this as follow up, but this is a good starting point in any case.

if (block.length == 0) {
break;
}
if (block.length == block.popcount) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically you don't need a popcount at all for this optimization. You just need to check whether all bits in the block are 1, or whether all bits in the block are 0. This should be trivial (and could perhaps be folded in VisitBits and/or VisitArrayDataInline).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the popcount in order to calculate the number of "on" elements for filters, but yes for this particular optimization between "all-on" and "not-all-on" you don't need the popcount.

@wesm wesm closed this in c7baa7c Jun 4, 2020
@wesm wesm deleted the ARROW-9029 branch June 4, 2020 13:31
@wesm
Copy link
Member Author

wesm commented Jun 4, 2020

Note, I would like to have a binary version of this also. I just opened https://issues.apache.org/jira/browse/ARROW-9034

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants