Skip to content

Conversation

@michalursa
Copy link
Contributor

Implementing register blocked Bloom filter to be used in hash join exec node.

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@michalursa michalursa force-pushed the ARROW-15239-bloom-filter branch 6 times, most recently from 628e924 to d06e693 Compare January 4, 2022 17:56
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just getting started looking through this, mostly was just learning how it all worked. I have a few questions for my own learning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to a paper or description describing the filter you based your implementation on and then briefly describe what's variant about this one (or at least what the goals were)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalursa I'm learning your code about bloom filter, but didn't understand very well. Could you provide me some papers or docs about its principles? Thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mdianjun, I wrote a blog post about this bloom filter here: https://save-buffer.github.io/bloom_filter.html

A lot of the ideas for this Bloom filter come from the paper Cache-, Hash- and Space-Efficient Bloom Filters (https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf), but I believe Michal invented the patterned approach himself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @save-buffer is right. You can try to search Split Block Bloom Filter, and Parquet, Impala all uses similiar implementation

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this. I think it's great to have.

The test will need to be cleaned up and probably split into tests, focused on assertions, and a benchmark, gathering performance. It is sort of doing both at the moment.

I also think we have two vendored versions of xxhash at the moment so it would probably be good to get that resolved (although it appears that was the case before this PR so it doesn't have to be tied up with this PR per se and we can do it in a follow-up).

@michalursa michalursa marked this pull request as ready for review January 25, 2022 08:02
@michalursa michalursa force-pushed the ARROW-15239-bloom-filter branch 4 times, most recently from cd09176 to d5e66d4 Compare March 8, 2022 08:58
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on this :) I reviewed the hashing tests. I think there is a bit of missing coverage and the test could be simplified. If you want to tackle it then I've added some comments. I also prototyped a solution using the builders to make sure it was possible: https://github.com/michalursa/arrow/compare/ARROW-15239-bloom-filter...westonpace:experiment/simplify-vector-hash-test?expand=1

I can also add the extra tests for the missing coverage next week if you'd prefer.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the bloom filter test today and I think, at this point, I've reviewed pretty much all of this PR. Both tests still need a bit of work but both the hashing and the bloom filter are great additions will be very useful, thank you for the contribution. Let me know whatever I can do to help get this merged in.

@michalursa michalursa force-pushed the ARROW-15239-bloom-filter branch from d5e66d4 to 8ad182a Compare March 15, 2022 01:44
@michalursa
Copy link
Contributor Author

@michalursa michalursa force-pushed the ARROW-15239-bloom-filter branch from 8ad182a to 5be6f56 Compare March 15, 2022 05:01
@westonpace
Copy link
Member

Ah, my suggestions have upset the Windows build, here is one more commit to make Windows happy: westonpace@f654725

@michalursa michalursa force-pushed the ARROW-15239-bloom-filter branch 2 times, most recently from f6c2fd1 to 0996885 Compare March 23, 2022 06:35
@michalursa michalursa force-pushed the ARROW-15239-bloom-filter branch from 0996885 to 12f9e88 Compare March 23, 2022 06:56
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This latest force-push added some cleanup of the bloom_filter_test. At this point I think my concerns have been addressed and we can move forwards.

I think it would be good to have some benchmarks and info around the utilities in key_hash.h/cc but that can be done later. I've created ARROW-16017 for that follow-up work.

@ursabot
Copy link

ursabot commented Mar 24, 2022

Benchmark runs are scheduled for baseline = 18e7cbf and contender = 065236f. 065236f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.54% ⬆️1.55%] test-mac-arm
[Finished ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.26% ⬆️1.53%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants