Skip to content

Conversation

@save-buffer
Copy link
Contributor

This adds Bloom filter pushdown between hash join nodes.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@save-buffer save-buffer marked this pull request as draft January 28, 2022 23:11
@save-buffer save-buffer changed the title [C++] Bloom filter pushdown for Hash Join ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins Jan 28, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@westonpace westonpace self-requested a review February 1, 2022 18:33
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 3 times, most recently from 800a969 to ed4d9dc Compare February 8, 2022 18:29
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch from ed4d9dc to 1ab0bab Compare April 12, 2022 21:52
@save-buffer save-buffer marked this pull request as ready for review April 12, 2022 21:56
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 4 times, most recently from c9b57e0 to 02a003a Compare April 25, 2022 20:01
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to review this. Still need to go through hash_join.cc but I have some initial comments from the periphery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we padding here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird if you Init the TempVectorStack with one size and then it segfaults if you try to alloc that much memory. That's because alloc bumps the stack by PaddedAllocationSize(size) + 2 * sizeof(uint64_t)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed after we fixed other TSAN related issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had to make blocks atomic to make TSAN go away, which we don't want to do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore it? Return it if it isn't ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda expression needs to return void here. I can maybe DCHECK_OK it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the idea here to measure the overhead cost of building the bloom filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we benchmark to see what kind of performance impact the Bloom filter has.
But since we currently only do early-elimination of rows, and only build on the build side, this disqualifies some types of joins, so we check that here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we operate on these pairs everywhere why not create:

struct BloomFilterTarget {
  HashJoinImpl* join_impl;
  std::vector<int> column_map;
};

It's also takes a bit of reading to figure out what the purpose of column_map is so this could be a place to briefly describe that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only use the pair in one spot as far as I can tell. I just use it so that I can use std::tie on whoever calls GetPushdownTarget. I did add a big comment though

@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch from 7d4502d to 4b6ccf4 Compare April 26, 2022 19:51
@westonpace westonpace self-requested a review April 28, 2022 23:55
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 2 times, most recently from 139dcae to 0575ddf Compare May 5, 2022 21:09
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 3 times, most recently from 653f13b to 619f098 Compare May 9, 2022 18:19
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions but overall I think this is pretty much ready to go once #13091 merges.

@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch from ad92449 to 15f3c37 Compare May 9, 2022 22:30
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rebase on top of master now that the thread scheduler issue is in. Then, assuming CI passes, I think this is ready to go.

@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 2 times, most recently from 4e0621a to ede7599 Compare May 16, 2022 20:25
@save-buffer
Copy link
Contributor Author

OK I think this is good now. The various failures seem to be because of :

tests/test_plasma.py::TestPlasmaClient::test_connection_failure_raises_exception

And

685  CMake Error at cmake_modules/ThirdpartyToolchain.cmake:239 (find_package):
686  Could not find a package configuration file provided by "GTest" (requested
687  version 1.10.0) with any of the following names:
688
689    GTestConfig.cmake
690    gtest-config.cmake
691
692  Add the installation prefix of "GTest" to CMAKE_PREFIX_PATH or set
693  "GTest_DIR" to a directory containing one of the above files.  If "GTest"
694  provides a separate development package or SDK, be sure it has been
695  installed.
696

The latter of which is being addressed in #13101

@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch from c8be4c8 to 1a4ae69 Compare May 17, 2022 22:26
@westonpace
Copy link
Member

I played around with this today and saw similar. Some bloom filter combinations were just very slow on my laptop. In general though I didn't see any true deadlock though I can never fully rule that out.

@save-buffer
Copy link
Contributor Author

Just to be clear, it's not the Bloom filter that's slow (the slow parts tend to be full outer joins, where Bloom filter is disabled). It seems to be related to residual filters being slow, in particular KeyEncoder::DecodeNulls

@westonpace
Copy link
Member

Yes. The slowdown happened in ProbeQueuedBatches which would make sense (I think) if it was related to residual filters.

@ursabot
Copy link

ursabot commented May 18, 2022

Benchmark runs are scheduled for baseline = 6faee47 and contender = 0742f78. 0742f78 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.98% ⬆️1.63%] test-mac-arm
[Failed ⬇️0.0% ⬆️11.79%] ursa-i9-9960x
[Finished ⬇️1.66% ⬆️0.32%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0742f78a ec2-t3-xlarge-us-east-2
[Failed] 0742f78a test-mac-arm
[Failed] 0742f78a ursa-i9-9960x
[Finished] 0742f78a ursa-thinkcentre-m75q
[Finished] 6faee474 ec2-t3-xlarge-us-east-2
[Finished] 6faee474 test-mac-arm
[Finished] 6faee474 ursa-i9-9960x
[Finished] 6faee474 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 18, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants