Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jul 30, 2025

Rationale for this change

This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See #7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618

What changes are included in this PR?

  • ArrowWriter now has a pub get_column_writers method that can be used to write columns concurrently.
  • Minor change to how encryption tests read test data.

Are these changes tested?

Yes.

Are there any user-facing changes?

pub ArrowWriter.get_column_writers and pub ArrowWriter.append_row_group are added. Both to enable concurrent use of column writers. WriterPropertiesBuilder now implements Default.

@rok rok marked this pull request as ready for review July 30, 2025 17:41
@rok rok force-pushed the multi-threaded_encrypted_writing_2 branch from 79bbf7c to 49c5d5b Compare July 31, 2025 12:25
@rok rok force-pushed the multi-threaded_encrypted_writing_2 branch from 49c5d5b to 3d93a49 Compare July 31, 2025 12:28
@rok rok requested review from adamreeve and alamb July 31, 2025 12:30
rok and others added 5 commits August 1, 2025 13:30
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
@rok rok requested a review from adamreeve August 1, 2025 16:28
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me thanks @rok. Does this approach look more compatible with your plans for refactoring ArrowRowGroupWriter @alamb?

@rok rok force-pushed the multi-threaded_encrypted_writing_2 branch from 75c47d1 to d959021 Compare August 4, 2025 13:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rok and @adamreeve -- I think this PR and API looks very nice to me

cc @etseidl @mbutrovich as you may have an interest here

self.finish()
}

/// Create a new row group writer and return its column writers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this API 👌

@alamb alamb merged commit 5036ca8 into apache:main Aug 7, 2025
16 checks passed
@adamreeve
Copy link
Contributor

See apache/datafusion#16738 (comment), I'm worried this API is a bit too limiting for what DataFusion needs.

Maybe we should have held off merging this until we were sure it worked with DataFusion, but this didn't go into the 56.0.0 release so I guess we've got a bit of time to change things if needed?

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

See apache/datafusion#16738 (comment), I'm worried this API is a bit too limiting for what DataFusion needs.

Maybe we should have held off merging this until we were sure it worked with DataFusion, but this didn't go into the 56.0.0 release so I guess we've got a bit of time to change things if needed?

It hasn't been released yet I don't think. It is scheduled for sometime later this month.

Is there a ticket or something we can file to track the resolution of whatever you have found downstream? If so we can add it to the release ticket and I'll make sure to follow up before release

@rok
Copy link
Member Author

rok commented Aug 12, 2025

@alamb I opened #8115 to track work here.

alamb added a commit that referenced this pull request Sep 19, 2025
…uet (#8162)

- Closes #8115.
- Closes #8260
- Closes #8259

# Rationale for this change

#8029 introduced `pub
ArrowWriter.get_column_writers` and `pub ArrowWriter.append_row_group`
to enable multi-threaded parquet encrypted writing. However testing
downstream showed the API is not feasible, see #8115.

# What changes are included in this PR?

This introduces `pub ArrowWriter.into_serialized_writer` and deprecates
`pub ArrowWriter.get_column_writers` and `pub
ArrowWriter.append_row_group`. It also makes
`ArrowRowGroupWriterFactory` public and adds a `pub
ArrowRowGroupWriterFactory.create_column_writers`.

# Are these changes tested?

This includes a DataFusion inspired test for concurrent writing across
columns and row groups to make sure parallel writing is and remains
possible with `ArrowWriter`s API. Further we created a draft PR in
DataFusion apache/datafusion#16738 to test for
multithreaded writing support.

# Are there any user-facing changes?

See description of changes.

---------

Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multi-threaded writing of Parquet files with modular encryption

3 participants