[C++] Support hash-join on larger than memory datasets

The current implementation of the hash-join node current queues in memory the hashtable, the entire build side input, and the entire probe side input (e.g. the entire dataset).  This means the current implementation will run out of memory and crash if the input dataset is larger than the memory on the system.

By spilling to disk when memory starts to fill up we can allow the hash-join node to process datasets larger than the available memory on the machine.

**Reporter**: [Weston Pace](https://issues.apache.org/jira/browse/ARROW-16389) / @westonpace
#### Related issues:
- [[C++] Naive spillover implementation for join](https://github.com/apache/arrow/issues/29750) (supercedes)
#### PRs and other links:
- [GitHub Pull Request #13669](https://github.com/apache/arrow/pull/13669)

<sub>**Note**: *This issue was originally created as [ARROW-16389](https://issues.apache.org/jira/browse/ARROW-16389). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Support hash-join on larger than memory datasets #31769

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Support hash-join on larger than memory datasets #31769

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions