-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support multi-threaded writing of Parquet files with modular encryption #8029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Adam Reeve <adreeve@gmail.com>
79bbf7c to
49c5d5b
Compare
49c5d5b to
3d93a49
Compare
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
75c47d1 to
d959021
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rok and @adamreeve -- I think this PR and API looks very nice to me
cc @etseidl @mbutrovich as you may have an interest here
| self.finish() | ||
| } | ||
|
|
||
| /// Create a new row group writer and return its column writers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this API 👌
|
See apache/datafusion#16738 (comment), I'm worried this API is a bit too limiting for what DataFusion needs. Maybe we should have held off merging this until we were sure it worked with DataFusion, but this didn't go into the 56.0.0 release so I guess we've got a bit of time to change things if needed? |
It hasn't been released yet I don't think. It is scheduled for sometime later this month. Is there a ticket or something we can file to track the resolution of whatever you have found downstream? If so we can add it to the release ticket and I'll make sure to follow up before release |
…uet (#8162) - Closes #8115. - Closes #8260 - Closes #8259 # Rationale for this change #8029 introduced `pub ArrowWriter.get_column_writers` and `pub ArrowWriter.append_row_group` to enable multi-threaded parquet encrypted writing. However testing downstream showed the API is not feasible, see #8115. # What changes are included in this PR? This introduces `pub ArrowWriter.into_serialized_writer` and deprecates `pub ArrowWriter.get_column_writers` and `pub ArrowWriter.append_row_group`. It also makes `ArrowRowGroupWriterFactory` public and adds a `pub ArrowRowGroupWriterFactory.create_column_writers`. # Are these changes tested? This includes a DataFusion inspired test for concurrent writing across columns and row groups to make sure parallel writing is and remains possible with `ArrowWriter`s API. Further we created a draft PR in DataFusion apache/datafusion#16738 to test for multithreaded writing support. # Are there any user-facing changes? See description of changes. --------- Co-authored-by: Adam Reeve <adreeve@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Rationale for this change
This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See #7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618
What changes are included in this PR?
ArrowWriternow has apub get_column_writersmethod that can be used to write columns concurrently.Are these changes tested?
Yes.
Are there any user-facing changes?
pub ArrowWriter.get_column_writersandpub ArrowWriter.append_row_groupare added. Both to enable concurrent use of column writers.WriterPropertiesBuildernow implementsDefault.