Skip to content

[C++] Support hash-join on larger than memory datasets #31769

@asfimport

Description

@asfimport

The current implementation of the hash-join node current queues in memory the hashtable, the entire build side input, and the entire probe side input (e.g. the entire dataset). This means the current implementation will run out of memory and crash if the input dataset is larger than the memory on the system.

By spilling to disk when memory starts to fill up we can allow the hash-join node to process datasets larger than the available memory on the machine.

Reporter: Weston Pace / @westonpace

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-16389. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions