-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The ordinary hash join (OHJ) is a great solution when one side of the data is static and can fit into memory. However, the sort-merge join (SMJ) is more effective when keys in the join condition are already sorted. If the join filter expression has order guarantees, but not the join key, both OHJ, and SMJ can result in suboptimal performance.
This is where Symmetric Hash Join (SHJ) comes in. SHJ addresses the gap in join use cases by introducing support for filter expressions with order guarantees, such as sliding windows.
For example, consider the following query:
SELECT * FROM left_table, right_table
WHERE
left_key = right_key AND
a > b + 3 AND
a < b + 10In this scenario, the columns a and b are sorted. In this case, SMJ wouldn't be effective and OHJ may struggle with low cardinality join keys.
SHJ extends the join capabilities of Datafusion by handling such use cases efficiently. While ordinary hash join typically remains the preferable option when both sources are finite, the join type can be changed to SHJ using a PipelineFixer sub-rule when both sources are unbounded.
Describe the solution you'd like
At skeleton implementation of SHJ that can be improved on, maybe first limited to partitioned mode only and lacking full support for output order information, but extensible enough so that these capabilities can be implemented later on. In detail:
- Provide a way to support sliding window semantics in
PhysicalExprs - Add a sub-rule to
PipelineFixerto replace theHashJoinif necessary.
Describe alternatives you've considered
NA
Additional context
NA