-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16166: [C++][Compute] Utilities for assembling join output #12872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
save-buffer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I guess I already reviewed a bunch of this code on the other PR.
One thing: is it worth to add Hashing32::HashBatch and Hashing64::HashBatch in this PR? I need those for Bloom Filter pushdown, not sure if those fit better here or there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove or uncomment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my mistake. This test fails by the way so I wasn't sure if that was something we wanted to address in this PR or file a follow-up JIRA for. I think the hash-join PR doesn't operate on null columns so it doesn't need it but since we specific code branches to support null in a few spots it seemed like we would want to support it at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see above that dictionary type may be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this part.
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts I had as I was working on tests.
cpp/src/arrow/compute/light_array.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these compare to a theoretical ExecBatch refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose these are far too limited in terms of data type support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this isn't intended to be a wholesale replacement of ExecBatch, rather something that is small enough to get a PR through improving the hash-join performance. Once that is done we can either grow these tools to support more types, replace them with something more complete but more efficient, or accept multiple variations of kernels/nodes depending on benchmarks and need.
I think, by having these tools here, it will be easier to evaluate and work on potential replacements for ExecBatch, even if someone doesn't know the ins and outs of the join algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KeyColumnMetadata and KeyColumnArray classes are used in hash group by and in new hash join for interaction with the hash table. The reason to have them was to be able to work on smaller batches than ExecBatch and therefore to keep intermediate results of multi-step computations in L1 cache. The goal was to get rid of the overheads of memory allocations and shared pointers (and atomic instructions for ref counting) when doing things like slicing.
If we make ExecBatch more lightweight in the future, it is likely that new ExecBatch structures would be good for hash tables. But the KeyColumn... classes so far restrict the functionality to minimum needed for hash table keys in terms of data type support and abstraction of the data (e.g. hash table is interested in data movement and interpretation of the bytes in column values is left to callback functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. At the very least it helps show what a replacement might look like indeed.
cpp/src/arrow/compute/light_array.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth it compared to the usual Result<> pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to return Result<>.
In hash join (which will be the first consumer of this interface) data type checks are done during initialization of ExecNode, so that introduces redundant checks of status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no support for these data types yet (but there probably will be in the future).
I can remove the commented out part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, dictionary type is not supported in KeyColumn... (hash table either works with indices from dictionary or decoded values). Once new hash join is ready, there will be a follow up change to port dictionary type support from old hash join to new hash join, which will not have it initially. At this point some of the code from this PR may need to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see above that dictionary type may be handled?
Hashing seems unrelated to the topic of this PR, but we could possibly make a tiny separate PR just for HashBatch (or put it into Bloom filter pushdown). |
a572001 to
f507f2b
Compare
|
I've added ARROW-16224 for for the follow-up to add support for large binary types and dictionary encoded types as key columns. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more naming nits but overall I think I'm satisfied. In the future let's try and avoid squash-rebase after the PR reviews have begun as it makes it more difficult to see the specific changes that were added to address comments.
Adding some utilities that make it easier to filter, replicate and accumulate rows from potentially multiple sources in a single exec batch. To be used in hash join for reducing overheads in producing its output. Also, moving light-weight array utilities to the same file and adding helper functions for generating them. Thanks to Weston Pace for updating comments and writing the unit tests.
f507f2b to
e27a1a3
Compare
|
@lidavidm Any further thoughts? |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the implementation too closely but the API/tests look fine.
|
Benchmark runs are scheduled for baseline = 2e7acab and contender = b0c75de. b0c75de is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Adding utilities that make it easier to filter, replicate and accumulate rows from potentially multiple sources in a single exec batch.
This is meant to be used in hash join for reducing overheads in producing its output.
Also, moving light-weight array utilities to the same file and adding helper functions for generating them.
Thanks to Weston Pace for updating comments and writing the unit tests.