-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
|
800a969 to
ed4d9dc
Compare
ed4d9dc to
1ab0bab
Compare
c9b57e0 to
02a003a
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to review this. Still need to go through hash_join.cc but I have some initial comments from the periphery.
cpp/src/arrow/compute/exec/util.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we padding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird if you Init the TempVectorStack with one size and then it segfaults if you try to alloc that much memory. That's because alloc bumps the stack by PaddedAllocationSize(size) + 2 * sizeof(uint64_t)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed after we fixed other TSAN related issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had to make blocks atomic to make TSAN go away, which we don't want to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ignore it? Return it if it isn't ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda expression needs to return void here. I can maybe DCHECK_OK it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the idea here to measure the overhead cost of building the bloom filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we benchmark to see what kind of performance impact the Bloom filter has.
But since we currently only do early-elimination of rows, and only build on the build side, this disqualifies some types of joins, so we check that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we operate on these pairs everywhere why not create:
struct BloomFilterTarget {
HashJoinImpl* join_impl;
std::vector<int> column_map;
};
It's also takes a bit of reading to figure out what the purpose of column_map is so this could be a place to briefly describe that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only use the pair in one spot as far as I can tell. I just use it so that I can use std::tie on whoever calls GetPushdownTarget. I did add a big comment though
7d4502d to
4b6ccf4
Compare
139dcae to
0575ddf
Compare
653f13b to
619f098
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions but overall I think this is pretty much ready to go once #13091 merges.
ad92449 to
15f3c37
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rebase on top of master now that the thread scheduler issue is in. Then, assuming CI passes, I think this is ready to go.
4e0621a to
ede7599
Compare
|
OK I think this is good now. The various failures seem to be because of : And The latter of which is being addressed in #13101 |
c8be4c8 to
1a4ae69
Compare
|
I played around with this today and saw similar. Some bloom filter combinations were just very slow on my laptop. In general though I didn't see any true deadlock though I can never fully rule that out. |
|
Just to be clear, it's not the Bloom filter that's slow (the slow parts tend to be full outer joins, where Bloom filter is disabled). It seems to be related to residual filters being slow, in particular KeyEncoder::DecodeNulls |
|
Yes. The slowdown happened in ProbeQueuedBatches which would make sense (I think) if it was related to residual filters. |
|
Benchmark runs are scheduled for baseline = 6faee47 and contender = 0742f78. 0742f78 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
This adds Bloom filter pushdown between hash join nodes.