Skip to content

Consolidate Projection for Schema and RecordBatch #1425

@alamb

Description

@alamb

Background

@hntd187 fixed #1361 via #1378 but when I was reviewing the code, I found several other places that project RecordBatchs and Schemas that may also have the same subtle issues about losing the metadata. I am not sure of any bugs related to this yet but I fear they are lurking

The basic idea is to make functions like the following (which handle metadata correctly, following the pattern in #1361 )

fn project_schema(schema: &Schema, projection: &[usize]) -> <Schema> {
...
}

fn project_batch(batch: &RecordBatch, projection: &[usize]) -> Result<RecordBatch> {
...
}

And replace the duplicated code like

        let projected_schema = match &projection {
            Some(columns) => {
                let fields: Result<Vec<Field>> = columns
                    .iter()
                    .map(|i| {
                        if *i < schema.fields().len() {
                            Ok(schema.field(*i).clone())
                        } else {
                            Err(DataFusionError::Internal(
                                "Projection index out of range".to_string(),
                            ))
                        }
                    })
                    .collect();
                Arc::new(Schema::new(fields?))
            }
            None => Arc::clone(&schema),
        };

And

                Some(columns) => Some(RecordBatch::try_new(
                    self.schema.clone(),
                    columns.iter().map(|i| batch.column(*i).clone()).collect(),
                )),

ALl over the datafusion codebase

Additional context
Here is a corresponding arrow ticket to put the logic into arrow-rs: apache/arrow-rs#1014

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions