-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14181: [C++][Compute] Support for dictionaries in hash join #11446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
b381cdf to
8c9faf2
Compare
8c9faf2 to
53124cb
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me, but I lack the context to fully review the changes to the join code itself (in hash_join.cc).
I noticed generally we always pass a validity buffer around even in situations where it could be omitted - presumably it's cleaner implementation-wise to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, thanks for the detailed comments in this file - they help a lot in understanding what's going on here.
bb81d08 to
5c52e41
Compare
5c52e41 to
2f42618
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the R tests to include dictionaries and they pass
|
Benchmark runs are scheduled for baseline = 528625e and contender = 230afef. 230afef is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Supporting dictionary arrays and dictionary scalars as inputs to hash join on both its sides, in key columns and non-key columns. A key column from probe side of the join can be matched against a key column from build side of the join, as long as the underlying value types are equal, that means that: - dictionary column (on either side) can be matched against non-dictionary column (on the other side) if underlying value types are equal - dictionary column can be matched against dictionary column with a different index type, and potentially using a different dictionary, as long as the underlying value types are equal We keep the same limitation that is present in hash group by with respect to dictionaries, that is the same dictionary must be used for a given column in all input exec batches. The values in the dictionary do not have to be unique - it can contain duplicate entries and/or null entries. Closes #11446 from michalursa/ARROW-14181-hash-join-dict Lead-authored-by: michalursa <michal@ursacomputing.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Supporting dictionary arrays and dictionary scalars as inputs to hash join on both its sides, in key columns and non-key columns.
A key column from probe side of the join can be matched against a key column from build side of the join, as long as the underlying value types are equal, that means that:
types are equal
We keep the same limitation that is present in hash group by with respect to dictionaries, that is the same dictionary must be used for a given column in all input exec batches. The values in the dictionary do not have to be unique - it can contain duplicate entries and/or null entries.