Skip to content

Conversation

@devinjdangelo
Copy link
Contributor

Which issue does this PR close?

Enables apache/datafusion#7632
Related to #1718

Rationale for this change

While working on improving DataFusions async parallel parquet writer (draft pr: apache/datafusion#7632), some changes were needed at the arrow level. The DataFusion parallel writer writes separate partitions to different RowGroups in parallel and later concatenates them into one file. Using ArrowRowGroupWriter directly is more efficient for this use case than ArrowWriter and allows retaining column indexes/bloom filters.

What changes are included in this PR?

  • ArrowRowGroupWriter and related structs/methods are marked public
  • OnCloseRowGroup closure is now required to be Send

Are there any user-facing changes?

ArrowRowGroupWriter can now be used directly by downstream arrow users.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me


/// A [`Read`] for an iterator of [`Bytes`]
struct ChainReader(Peekable<IntoIter<Bytes>>);
pub struct ChainReader(Peekable<IntoIter<Bytes>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to rename this to ArrowColumnChunkReader or something, as it will now be public

@tustvold tustvold added the api-change Changes to the arrow API label Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants