GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter#35886
GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter#35886pitrou merged 16 commits intoapache:mainfrom
Conversation
|
|
mapleFU
left a comment
There was a problem hiding this comment.
Seems parquet part change all compression level to options. Do we need to keep api compatible?
1e5c232 to
fdfb1e8
Compare
cpp/src/parquet/column_writer.h
Outdated
There was a problem hiding this comment.
This should be put to the end for capability
There was a problem hiding this comment.
Also, unless the PageWriter needs to keep ownership of the codec options, this should simply be const CodecOptions& = {}.
There was a problem hiding this comment.
In addition to the comments above, should be consolidate Compression::type codec and const std::shared_ptr<CodecOptions>& codec_options? codec_options should know codec.
cpp/src/parquet/properties.h
Outdated
There was a problem hiding this comment.
Since Codec contains compression_level_, would this be duplicated or useless?
There was a problem hiding this comment.
Now we have a compression_level and a CodecOptions. Previously, user can set compression_level, and the compression level would be changed.
After this patch, if user change the compression level, the codec will not be changed.
Can we make CodecOptions mutate it's compression_level, and make set_compression_level set the level in CodecOptions directly?
There was a problem hiding this comment.
Can we just get it from codec_options_?
There was a problem hiding this comment.
+1 on this. compression_level_ is duplicated.
cpp/src/arrow/util/compression.h
Outdated
There was a problem hiding this comment.
This is a breaking change. It would be better to add a new overload. Then implement this one based on the new overload and label it as deprecated.
There was a problem hiding this comment.
Yes, that could solve the compatibility issue of different APIs. Just to make it clear, you suggest marking the newly added API with CodecOptions as deprecated right? Or the previous one?
There was a problem hiding this comment.
I mean deprecating the existing one.
There was a problem hiding this comment.
Great, that makes sense :)
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
I am not sure if we are able to move codec into codec_options as well. I don't have a preference here.
There was a problem hiding this comment.
Those changes shouldn't be necessary? (also, why iostream?)
There was a problem hiding this comment.
Let's make the error message respect the macro values:
| return Status::Invalid("window_bits should be within 10 ~ 24"); | |
| return Status::Invalid("window_bits should be between ", BROTLI_MIN_WINDOW_BITS, | |
| " and ", BROTLI_MAX_WINDOW_BITS); |
There was a problem hiding this comment.
Nit, but can you undo the include ordering change? It's a bit gratuitous. Also, it makes things more readable to separate stdlib imports from third-party library imports.
There was a problem hiding this comment.
Sure, that might be due to some local format auto-change, will revert those.
There was a problem hiding this comment.
This is a bit convoluted
| int CompressionWindowBitsForFormat(GZipFormat::type format, int input_window_bits) { | |
| int window_bits = input_window_bits; | |
| int CompressionWindowBitsForFormat(GZipFormat::type format, int window_bits) { |
cpp/src/parquet/types.h
Outdated
cpp/src/parquet/types.h
Outdated
There was a problem hiding this comment.
Again, I'm not sure it's useful to deprecate this one.
r/src/compression.cpp
Outdated
|
@yyang52 There are Python test failures on CI. Also, is it possible to pass a base |
Yes, I think it supports that (as checked in TEST(TestCodecMisc, SpecifyCompressionLevel) to pass base CodecOptions to different Codecs) |
mapleFU
left a comment
There was a problem hiding this comment.
I'm general ok with the parquet side changes now, there are some comments.
cpp/src/arrow/util/compression.h
Outdated
There was a problem hiding this comment.
Can it use BROTLI_DEFAULT_WINDOW?
There was a problem hiding this comment.
yes, but then we need to include the brotli header here as well?
There was a problem hiding this comment.
Yes you're right. I'm not familiar with compression algorithms, so maybe I'm asking a stupid question. but I'm afraid that brorli might change it's default bits.
Would it be ok for a optional window_bits, if it's set, use it, otherwise using default value?
There was a problem hiding this comment.
yes, I think that's just how it is implemented now in GZip/BrotliCodecOptions?
There was a problem hiding this comment.
Sure, I notice that this patch hard code the encoder's internal into compression.h, so I'm afraid if they're changed in the 3rd libraries, we'll get confused here.
There was a problem hiding this comment.
yeah it seems better to use their BROTLI_DEFAULT_WINDOW
There was a problem hiding this comment.
Same with my previous comment. We can use BROTLI_DEFAULT_WINDOW in the .cc file
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
I think just *codec_options is ok?
cpp/src/parquet/properties.h
Outdated
There was a problem hiding this comment.
Now we have a compression_level and a CodecOptions. Previously, user can set compression_level, and the compression level would be changed.
After this patch, if user change the compression level, the codec will not be changed.
Can we make CodecOptions mutate it's compression_level, and make set_compression_level set the level in CodecOptions directly?
May you help to trigger the remaining CI checks again? Thanks! @pitrou |
cpp/src/parquet/column_writer.h
Outdated
| OffsetIndexBuilder* offset_index_builder = NULLPTR, | ||
| const CodecOptions& codec_options = CodecOptions{}); | ||
|
|
||
| ARROW_DEPRECATED("Use the one with CodecOptions instead") |
There was a problem hiding this comment.
| ARROW_DEPRECATED("Use the one with CodecOptions instead") | |
| ARROW_DEPRECATED("Deprecated in 13.0.0. Use CodecOptions-taking overload instead.") |
cpp/src/arrow/util/compression.h
Outdated
| } | ||
| virtual ~CodecOptions() = default; | ||
|
|
||
| int compression_level_ = kUseDefaultCompressionLevel; |
There was a problem hiding this comment.
Forgot this, but style nit (this is a public member, it's weird to add a trailing underscore):
| int compression_level_ = kUseDefaultCompressionLevel; | |
| int compression_level = kUseDefaultCompressionLevel; |
|
@github-actions crossbow submit -g cpp |
|
Revision: 73b4018 Submitted crossbow builds: ursacomputing/crossbow @ actions-e0a8b0150b |
|
Thanks a lot for your contribution @yyang52 ! I hope you didn't mind the delays. |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a9035cb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
| if (!codec_options) { | ||
| pager = PageWriter::Open(sink_, properties_->compression(path), col_meta, | ||
| row_group_ordinal_, static_cast<int16_t>(column_ordinal), | ||
| properties_->memory_pool(), false, meta_encryptor, |
There was a problem hiding this comment.
This part lost buffered_row_group_, and set it to false. Sorry that I didn't found it at first time
There was a problem hiding this comment.
Sorry for this missing and thanks for the fix!
… Parquet column (#38025) ### Rationale for this change After the changes in #35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: #38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… for a Parquet column (apache#38025) ### Rationale for this change After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: apache#38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… for a Parquet column (apache#38025) ### Rationale for this change After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: apache#38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Based on #35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.
Authored-by: Yang Yang yang10.yang@intel.com
Co-authored-by: Rambacher, Mark mark.rambacher@intel.com
What changes are included in this PR?
Add CodecOptions and replace
compression_levelwhen creating the Codec. The design is basically based on previous discussions.Are these changes tested?
Yes
Are there any user-facing changes?
Yes, when user creates the
WriterProperties