Skip to content

Support for Sliding Windows Joins with Symmetric Hash Join (SHJ) #5321

@metesynnada

Description

@metesynnada

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The ordinary hash join (OHJ) is a great solution when one side of the data is static and can fit into memory. However, the sort-merge join (SMJ) is more effective when keys in the join condition are already sorted. If the join filter expression has order guarantees, but not the join key, both OHJ, and SMJ can result in suboptimal performance.

This is where Symmetric Hash Join (SHJ) comes in. SHJ addresses the gap in join use cases by introducing support for filter expressions with order guarantees, such as sliding windows.

For example, consider the following query:

SELECT * FROM left_table, right_table
WHERE
  left_key = right_key AND
  a > b + 3 AND
  a < b + 10

In this scenario, the columns a and b are sorted. In this case, SMJ wouldn't be effective and OHJ may struggle with low cardinality join keys.

SHJ extends the join capabilities of Datafusion by handling such use cases efficiently. While ordinary hash join typically remains the preferable option when both sources are finite, the join type can be changed to SHJ using a PipelineFixer sub-rule when both sources are unbounded.

Describe the solution you'd like
At skeleton implementation of SHJ that can be improved on, maybe first limited to partitioned mode only and lacking full support for output order information, but extensible enough so that these capabilities can be implemented later on. In detail:

  • Provide a way to support sliding window semantics in PhysicalExprs
  • Add a sub-rule to PipelineFixer to replace the HashJoin if necessary.

Describe alternatives you've considered
NA

Additional context
NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions