GH-43790: [Go][Parquet] Add support for LZ4_RAW compression codec#43835
GH-43790: [Go][Parquet] Add support for LZ4_RAW compression codec#43835joellubi merged 3 commits intoapache:mainfrom
Conversation
|
|
zeroshade
left a comment
There was a problem hiding this comment.
LGTM
How do the benchmarks look?
mapleFU
left a comment
There was a problem hiding this comment.
+1 for me but I'm not familiar with go impl
| // see: http://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAAri41v24xuA8MGHLDvgSnE+7AAgOhiEukemW_oPNHMvfMmrWw@mail.gmail.com%3E | ||
| Lz4 Compression | ||
| Zstd Compression | ||
| Lz4 Compression |
There was a problem hiding this comment.
Nit: maybe we can rename this to LZ4_HADOOP to prevent from misunderstanding
There was a problem hiding this comment.
The names here match the parquet spec names exactly, but I agree it would be helpful to limit confusion about this. I just pushed a change that returns a descriptive error when one of the unsupported codecs is used. Let me know what you think.
There was a problem hiding this comment.
Both is ok for me, we can draft a separate pr for that?
There was a problem hiding this comment.
Ok sure. I just reverted that commit, will add it back in a follow-up PR.
LZ4_RAW seems to be competitive with SNAPPY in terms of throughput, lagging behind it by a relatively small margin. Some research suggests that LZ4 should be faster than Snappy in the general case, but a number of benchmarks on columnar data specifically tend to still skew towards Snappy for speed. So I'd say these results are consistent with expectations. |
This reverts commit 105166c.
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6502f0e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ec (apache#43835) ### Rationale for this change Fixes: apache#43790 The LZ4 compression codec for Parquet is no longer ambiguous, as it has been superceded by the [LZ4_RAW](https://github.com/apache/parquet-format/blob/master/Compression.md#lz4_raw) spec. ### What changes are included in this PR? - Add `LZ4Raw` compression codec - Split out `StreamingCodec` methods from core `Codec` interface - Various conformance/roundtrip tests - Set of benchmarks for reading/writing an Arrow table to/from Parquet, using each compression codec ### Are these changes tested? Yes ### Are there any user-facing changes? - New codec `LZ4Raw` is available - `Codec` interface no long provides the following methods, which are now part of `StreamingCodec`: - `NewReader` - `NewWriter` - `NewWriterLevel` * GitHub Issue: apache#43790 Authored-by: Joel Lubinitsky <joellubi@gmail.com> Signed-off-by: Joel Lubinitsky <joellubi@gmail.com>
Rationale for this change
Fixes: #43790
The LZ4 compression codec for Parquet is no longer ambiguous, as it has been superceded by the LZ4_RAW spec.
What changes are included in this PR?
LZ4Rawcompression codecStreamingCodecmethods from coreCodecinterfaceAre these changes tested?
Yes
Are there any user-facing changes?
LZ4Rawis availableCodecinterface no long provides the following methods, which are now part ofStreamingCodec:NewReaderNewWriterNewWriterLevel