Skip to content

Add a RecordBatch::split to split large batches into a set of smaller batches #343

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Sometimes it is advantageous to split one large RecordBatch into smaller batches for processing (for example, processing the multiple smaller RecordBatches in parallel)

So instead of 1 RecordBatch with 1M rows, we could have 100 RecordBatches with 10,000 rows each that could be processed in paralle.

@tustvold implemented such a function in https://github.com/apache/arrow-datafusion/pull/379/files

    fn split_batch(sorted: &RecordBatch, batch_size: usize) -> Vec<RecordBatch> {

Describe the solution you'd like
Port the split_batch function into RecordBatch::split(batch_size) or something similar and add appropriate tests

cc @jorgecarleitao @nevi-me

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changeloggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions