Skip to content

Conversation

@michalursa
Copy link
Contributor

@michalursa michalursa commented Oct 18, 2021

Supporting dictionary arrays and dictionary scalars as inputs to hash join on both its sides, in key columns and non-key columns.

A key column from probe side of the join can be matched against a key column from build side of the join, as long as the underlying value types are equal, that means that:

  • dictionary column (on either side) can be matched against non-dictionary column (on the other side) if underlying value
    types are equal
  • dictionary column can be matched against dictionary column with a different index type, and potentially using a different dictionary, as long as the underlying value types are equal

We keep the same limitation that is present in hash group by with respect to dictionaries, that is the same dictionary must be used for a given column in all input exec batches. The values in the dictionary do not have to be unique - it can contain duplicate entries and/or null entries.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@michalursa michalursa force-pushed the ARROW-14181-hash-join-dict branch 2 times, most recently from b381cdf to 8c9faf2 Compare October 19, 2021 22:21
@michalursa michalursa force-pushed the ARROW-14181-hash-join-dict branch from 8c9faf2 to 53124cb Compare November 2, 2021 06:02
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me, but I lack the context to fully review the changes to the join code itself (in hash_join.cc).

I noticed generally we always pass a validity buffer around even in situations where it could be omitted - presumably it's cleaner implementation-wise to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, thanks for the detailed comments in this file - they help a lot in understanding what's going on here.

@michalursa michalursa force-pushed the ARROW-14181-hash-join-dict branch from bb81d08 to 5c52e41 Compare November 3, 2021 00:17
@michalursa michalursa force-pushed the ARROW-14181-hash-join-dict branch from 5c52e41 to 2f42618 Compare November 4, 2021 20:41
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the R tests to include dictionaries and they pass

@ursabot
Copy link

ursabot commented Nov 5, 2021

Benchmark runs are scheduled for baseline = 528625e and contender = 230afef. 230afef is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️3.08% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.27% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

kou pushed a commit that referenced this pull request Nov 10, 2021
Supporting dictionary arrays and dictionary scalars as inputs to hash join on both its sides, in key columns and non-key columns.

A key column from probe side of the join can be matched against a key column from build side of the join, as long as the underlying value types are equal, that means that:
- dictionary column (on either side) can be matched against non-dictionary column (on the other side) if underlying value
types are equal
- dictionary column can be matched against dictionary column with a different index type, and potentially using a different dictionary, as long as the underlying value types are equal

We keep the same limitation that is present in hash group by with respect to dictionaries, that is the same dictionary must be used for a given column in all input exec batches. The values in the dictionary do not have to be unique - it can contain duplicate entries and/or null entries.

Closes #11446 from michalursa/ARROW-14181-hash-join-dict

Lead-authored-by: michalursa <michal@ursacomputing.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants