-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15239: [C++][Compute] Adding Bloom filter implementation #12067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
628e924 to
d06e693
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just getting started looking through this, mostly was just learning how it all worked. I have a few questions for my own learning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to a paper or description describing the filter you based your implementation on and then briefly describe what's variant about this one (or at least what the goals were)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michalursa I'm learning your code about bloom filter, but didn't understand very well. Could you provide me some papers or docs about its principles? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this one https://www.vldb.org/pvldb/vol12/p502-lang.pdf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mdianjun, I wrote a blog post about this bloom filter here: https://save-buffer.github.io/bloom_filter.html
A lot of the ideas for this Bloom filter come from the paper Cache-, Hash- and Space-Efficient Bloom Filters (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf), but I believe Michal invented the patterned approach himself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @save-buffer is right. You can try to search Split Block Bloom Filter, and Parquet, Impala all uses similiar implementation
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this. I think it's great to have.
The test will need to be cleaned up and probably split into tests, focused on assertions, and a benchmark, gathering performance. It is sort of doing both at the moment.
I also think we have two vendored versions of xxhash at the moment so it would probably be good to get that resolved (although it appears that was the case before this PR so it doesn't have to be tied up with this PR per se and we can do it in a follow-up).
cd09176 to
d5e66d4
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working on this :) I reviewed the hashing tests. I think there is a bit of missing coverage and the test could be simplified. If you want to tackle it then I've added some comments. I also prototyped a solution using the builders to make sure it was possible: https://github.com/michalursa/arrow/compare/ARROW-15239-bloom-filter...westonpace:experiment/simplify-vector-hash-test?expand=1
I can also add the extra tests for the missing coverage next week if you'd prefer.
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the bloom filter test today and I think, at this point, I've reviewed pretty much all of this PR. Both tests still need a bit of work but both the hashing and the bloom filter are great additions will be very useful, thank you for the contribution. Let me know whatever I can do to help get this merged in.
d5e66d4 to
8ad182a
Compare
|
I merged https://github.com/michalursa/arrow/compare/ARROW-15239-bloom-filter...westonpace:experiment/simplify-vector-hash-test?expand=1 into the hash test code (key_hash_test.cc). |
8ad182a to
5be6f56
Compare
|
Ah, my suggestions have upset the Windows build, here is one more commit to make Windows happy: westonpace@f654725 |
f6c2fd1 to
0996885
Compare
0996885 to
12f9e88
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This latest force-push added some cleanup of the bloom_filter_test. At this point I think my concerns have been addressed and we can move forwards.
I think it would be good to have some benchmarks and info around the utilities in key_hash.h/cc but that can be done later. I've created ARROW-16017 for that follow-up work.
|
Benchmark runs are scheduled for baseline = 18e7cbf and contender = 065236f. 065236f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Implementing register blocked Bloom filter to be used in hash join exec node.