Skip to content

Add Schema::project and RecordBatch project function to project / select a subset of columns #1014

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is common to "project" (and pick a subset) of columns from a schema (and then also RecordBatch) for processing.

https://github.com/apache/arrow-datafusion/blob/299ab7d1c37c707fcd500d3428abbdbe4dc5399b/datafusion/src/datasource/empty.rs#L65-L71

https://github.com/apache/arrow-datafusion/blob/0facd4d483e8c289ee4e3a89487d0cd1ede1a110/datafusion/src/physical_plan/file_format/mod.rs#L83-L93

There are many instances of projection

            // apply projection
            match &self.projection {
                Some(columns) => Some(RecordBatch::try_new(
                    self.schema.clone(),
                    columns.iter().map(|i| batch.column(*i).clone()).collect(),
                )),
                None => Some(Ok(batch.clone())),
            }

Many (most) instances of projection don't handle metadata leading to bugs like apache/datafusion#1361

Describe the solution you'd like
Add projection functions to Schema and RecordBatch structs in the arrow-rs crate that properly handle metadata.

Proposed signatures:

/// Returns a new schema consisting of only the specified columns
///
/// So if a schema had Fields A, B and C, schema.project([2,1]) would return a new
/// schema with Fields B, and A
///
/// TODO example
fn Schema::project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
...
}
/// Returns a new RecordBatch consisting of only the specified columns
///
/// So if a RecordBatch had Columns A, B and C, batch.project([2,1]) would return a new
/// RecordBatch with Columns B, and A
///
/// TODO example
fn RecordBatch::project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
...
}

Describe alternatives you've considered

Additional context
@hntd187 added this feature in DataFusion in apache/datafusion#1378

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changeloggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions