-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Avoid per-batch field lookups in SchemaMapping #6563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid per-batch field lookups in SchemaMapping #6563
Conversation
|
|
||
| let rows_num = batch.num_rows(); | ||
| let mapped_batch = mapping.map_batch(batch).unwrap(); | ||
| let projected = batch.project(&projection).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highlights the major change, the schema adaptor assumes that the projection it output has been applied to the file_schema batches.
| /// to the table schema where possible. | ||
| /// | ||
| /// Returns a [`SchemaMapping`] that can be applied to the output batch | ||
| /// along with an ordered list of columns to project from the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ordered is important as parquet::ProjectionMask is not order preserving
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- this looks like a nice cleanup to me
| .zip(&self.field_mappings) | ||
| .map(|(field, file_idx)| match file_idx { | ||
| Some(batch_idx) => cast(&batch_cols[*batch_idx], field.data_type()), | ||
| None => Ok(new_null_array(field.data_type(), batch_rows)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #.
Rationale for this change
Follow up to #6458. This reworks the mapping logic to avoid needing to do column lookups per batch
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?