ARROW-16166: [C++][Compute] Utilities for assembling join output #12872

michalursa · 2022-04-13T05:50:48Z

Adding utilities that make it easier to filter, replicate and accumulate rows from potentially multiple sources in a single exec batch.
This is meant to be used in hash join for reducing overheads in producing its output.

Also, moving light-weight array utilities to the same file and adding helper functions for generating them.

Thanks to Weston Pace for updating comments and writing the unit tests.

github-actions · 2022-04-13T05:51:06Z

https://issues.apache.org/jira/browse/ARROW-16166

github-actions · 2022-04-13T05:51:08Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

save-buffer

Looks good to me, I guess I already reviewed a bunch of this code on the other PR.
One thing: is it worth to add Hashing32::HashBatch and Hashing64::HashBatch in this PR? I need those for Bloom Filter pushdown, not sure if those fit better here or there.

save-buffer · 2022-04-14T19:17:57Z

cpp/src/arrow/compute/light_array_test.cc

remove or uncomment

Ah, my mistake. This test fails by the way so I wasn't sure if that was something we wanted to address in this PR or file a follow-up JIRA for. I think the hash-join PR doesn't operate on null columns so it doesn't need it but since we specific code branches to support null in a few spots it seemed like we would want to support it at some point.

I also see above that dictionary type may be handled?

I removed this part.

westonpace

Some thoughts I had as I was working on tests.

cpp/src/arrow/compute/light_array.h

cpp/src/arrow/compute/light_array.cc

westonpace · 2022-04-14T20:43:42Z

@pitrou @lidavidm I'd appreciate your eyes on this if you have time. I think these utilities could be generally useful outside of hash-join and it might be nice to have some more eyes on them.

lidavidm · 2022-04-14T21:35:28Z

cpp/src/arrow/compute/light_array.h

How do these compare to a theoretical ExecBatch refactor?

I suppose these are far too limited in terms of data type support

Yes, this isn't intended to be a wholesale replacement of ExecBatch, rather something that is small enough to get a PR through improving the hash-join performance. Once that is done we can either grow these tools to support more types, replace them with something more complete but more efficient, or accept multiple variations of kernels/nodes depending on benchmarks and need.

I think, by having these tools here, it will be easier to evaluate and work on potential replacements for ExecBatch, even if someone doesn't know the ins and outs of the join algorithm.

The KeyColumnMetadata and KeyColumnArray classes are used in hash group by and in new hash join for interaction with the hash table. The reason to have them was to be able to work on smaller batches than ExecBatch and therefore to keep intermediate results of multi-step computations in L1 cache. The goal was to get rid of the overheads of memory allocations and shared pointers (and atomic instructions for ref counting) when doing things like slicing.

If we make ExecBatch more lightweight in the future, it is likely that new ExecBatch structures would be good for hash tables. But the KeyColumn... classes so far restrict the functionality to minimum needed for hash table keys in terms of data type support and abstraction of the data (e.g. hash table is interested in data movement and interpretation of the bytes in column values is left to callback functions).

Understood. At the very least it helps show what a replacement might look like indeed.

cpp/src/arrow/compute/light_array.h

lidavidm · 2022-04-14T21:39:48Z

cpp/src/arrow/compute/light_array.h

Is this worth it compared to the usual Result<> pattern?

I changed it to return Result<>.

In hash join (which will be the first consumer of this interface) data type checks are done during initialization of ExecNode, so that introduces redundant checks of status.

lidavidm · 2022-04-14T21:41:30Z

cpp/src/arrow/compute/light_array_test.cc

Why are these commented out?

There is no support for these data types yet (but there probably will be in the future).
I can remove the commented out part.

Also, dictionary type is not supported in KeyColumn... (hash table either works with indices from dictionary or decoded values). Once new hash join is ready, there will be a follow up change to port dictionary type support from old hash join to new hash join, which will not have it initially. At this point some of the code from this PR may need to be updated.

lidavidm · 2022-04-14T21:48:38Z

cpp/src/arrow/compute/light_array_test.cc

I also see above that dictionary type may be handled?

michalursa · 2022-04-15T06:11:11Z

Looks good to me, I guess I already reviewed a bunch of this code on the other PR. One thing: is it worth to add Hashing32::HashBatch and Hashing64::HashBatch in this PR? I need those for Bloom Filter pushdown, not sure if those fit better here or there.

Hashing seems unrelated to the topic of this PR, but we could possibly make a tiny separate PR just for HashBatch (or put it into Bloom filter pushdown).

westonpace · 2022-04-19T02:56:56Z

I've added ARROW-16224 for for the follow-up to add support for large binary types and dictionary encoded types as key columns.

westonpace

A few more naming nits but overall I think I'm satisfied. In the future let's try and avoid squash-rebase after the PR reviews have begun as it makes it more difficult to see the specific changes that were added to address comments.

cpp/src/arrow/compute/light_array.h

cpp/src/arrow/compute/light_array.cc

Adding some utilities that make it easier to filter, replicate and accumulate rows from potentially multiple sources in a single exec batch. To be used in hash join for reducing overheads in producing its output. Also, moving light-weight array utilities to the same file and adding helper functions for generating them. Thanks to Weston Pace for updating comments and writing the unit tests.

westonpace · 2022-04-21T08:40:10Z

@lidavidm Any further thoughts?

lidavidm

I didn't look at the implementation too closely but the API/tests look fine.

ursabot · 2022-04-24T00:40:53Z

Benchmark runs are scheduled for baseline = 2e7acab and contender = b0c75de. b0c75de is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.75%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/571| b0c75dee ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/559| b0c75dee test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/557| b0c75dee ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/569| b0c75dee ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/570| 2e7acabf ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/558| 2e7acabf test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/556| 2e7acabf ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/568| 2e7acabf ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Apr 13, 2022

save-buffer approved these changes Apr 14, 2022

View reviewed changes

westonpace reviewed Apr 14, 2022

View reviewed changes

cpp/src/arrow/compute/light_array.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/light_array.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/light_array.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/light_array.cc Outdated Show resolved Hide resolved

lidavidm reviewed Apr 14, 2022

View reviewed changes

michalursa force-pushed the ARROW-16166-join-array-utils branch 2 times, most recently from a572001 to f507f2b Compare April 15, 2022 08:06

westonpace approved these changes Apr 19, 2022

View reviewed changes

michalursa force-pushed the ARROW-16166-join-array-utils branch from f507f2b to e27a1a3 Compare April 19, 2022 23:08

Using named constants in KeyColumnArray

b14b266

westonpace requested a review from lidavidm April 21, 2022 08:40

lidavidm approved these changes Apr 21, 2022

View reviewed changes

lidavidm closed this in b0c75de Apr 21, 2022

westonpace mentioned this pull request May 17, 2022

ARROW-16590: [C++] Consolidate files dealing with row-major storage, add some helper methods #13172

Closed

ARROW-16166: [C++][Compute] Utilities for assembling join output #12872

ARROW-16166: [C++][Compute] Utilities for assembling join output #12872

Uh oh!

Conversation

michalursa commented Apr 13, 2022

Uh oh!

github-actions bot commented Apr 13, 2022

Uh oh!

github-actions bot commented Apr 13, 2022

Uh oh!

save-buffer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace commented Apr 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalursa Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalursa commented Apr 15, 2022

Uh oh!

westonpace commented Apr 19, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace commented Apr 21, 2022

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Apr 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

michalursa Apr 15, 2022 •

edited

Loading