Skip to content

GH-43790: [Go][Parquet] Add support for LZ4_RAW compression codec#43835

Merged
joellubi merged 3 commits intoapache:mainfrom
joellubi:gh-43790
Aug 27, 2024
Merged

GH-43790: [Go][Parquet] Add support for LZ4_RAW compression codec#43835
joellubi merged 3 commits intoapache:mainfrom
joellubi:gh-43790

Conversation

@joellubi
Copy link
Copy Markdown
Member

@joellubi joellubi commented Aug 26, 2024

Rationale for this change

Fixes: #43790

The LZ4 compression codec for Parquet is no longer ambiguous, as it has been superceded by the LZ4_RAW spec.

What changes are included in this PR?

  • Add LZ4Raw compression codec
  • Split out StreamingCodec methods from core Codec interface
  • Various conformance/roundtrip tests
  • Set of benchmarks for reading/writing an Arrow table to/from Parquet, using each compression codec

Are these changes tested?

Yes

Are there any user-facing changes?

  • New codec LZ4Raw is available
  • Codec interface no long provides the following methods, which are now part of StreamingCodec:
    • NewReader
    • NewWriter
    • NewWriterLevel

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #43790 has been automatically assigned in GitHub to PR creator.

@joellubi joellubi marked this pull request as ready for review August 26, 2024 22:04
@joellubi joellubi requested a review from zeroshade as a code owner August 26, 2024 22:04
Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

How do the benchmarks look?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Aug 27, 2024
Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for me but I'm not familiar with go impl

// see: http://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAAri41v24xuA8MGHLDvgSnE+7AAgOhiEukemW_oPNHMvfMmrWw@mail.gmail.com%3E
Lz4 Compression
Zstd Compression
Lz4 Compression
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe we can rename this to LZ4_HADOOP to prevent from misunderstanding

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names here match the parquet spec names exactly, but I agree it would be helpful to limit confusion about this. I just pushed a change that returns a descriptive error when one of the unsupported codecs is used. Let me know what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is ok for me, we can draft a separate pr for that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure. I just reverted that commit, will add it back in a follow-up PR.

@joellubi
Copy link
Copy Markdown
Member Author

LGTM

How do the benchmarks look?

goos: darwin
goarch: arm64
pkg: github.com/apache/arrow/go/v18/parquet/pqarrow
BenchmarkWriteTableCompressed/codec=UNCOMPRESSED-14                   13         101496837 ns/op         546.93 MB/s    222180261 B/op   1297249 allocs/op
BenchmarkWriteTableCompressed/codec=SNAPPY-14                         10         105285342 ns/op         527.25 MB/s    190905955 B/op   1297284 allocs/op
BenchmarkWriteTableCompressed/codec=GZIP-14                            7         163399286 ns/op         339.73 MB/s    196390286 B/op   1297654 allocs/op
BenchmarkWriteTableCompressed/codec=BROTLI-14                          2         503086730 ns/op         110.34 MB/s    698402444 B/op   1298624 allocs/op
BenchmarkWriteTableCompressed/codec=ZSTD-14                            8         139438708 ns/op         398.11 MB/s    1014671664 B/op  1298644 allocs/op
BenchmarkWriteTableCompressed/codec=LZ4_RAW-14                         9         115800444 ns/op         479.38 MB/s    167152016 B/op   1297259 allocs/op
BenchmarkReadTableCompressed/codec=UNCOMPRESSED-14                    42          34890312 ns/op         561.85 MB/s    235618446 B/op      2431 allocs/op
BenchmarkReadTableCompressed/codec=SNAPPY-14                          38          33923055 ns/op         272.67 MB/s    210971180 B/op      2428 allocs/op
BenchmarkReadTableCompressed/codec=GZIP-14                            22          55319667 ns/op          80.78 MB/s    202261534 B/op      2721 allocs/op
BenchmarkReadTableCompressed/codec=BROTLI-14                          16          71177099 ns/op          28.35 MB/s    233693905 B/op      2779 allocs/op
BenchmarkReadTableCompressed/codec=ZSTD-14                            21          56364889 ns/op          37.02 MB/s    196485768 B/op      2444 allocs/op
BenchmarkReadTableCompressed/codec=LZ4_RAW-14                         36          33904137 ns/op         267.63 MB/s    210578330 B/op      2427 allocs/op
PASS
ok      github.com/apache/arrow/go/v18/parquet/pqarrow  20.220s

LZ4_RAW seems to be competitive with SNAPPY in terms of throughput, lagging behind it by a relatively small margin. Some research suggests that LZ4 should be faster than Snappy in the general case, but a number of benchmarks on columnar data specifically tend to still skew towards Snappy for speed. So I'd say these results are consistent with expectations.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Aug 27, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 27, 2024
@joellubi joellubi merged commit 6502f0e into apache:main Aug 27, 2024
@joellubi joellubi removed the awaiting changes Awaiting changes label Aug 27, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 6502f0e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…ec (apache#43835)

### Rationale for this change

Fixes: apache#43790

The LZ4 compression codec for Parquet is no longer ambiguous, as it has been superceded by the [LZ4_RAW](https://github.com/apache/parquet-format/blob/master/Compression.md#lz4_raw) spec.

### What changes are included in this PR?

- Add `LZ4Raw` compression codec
- Split out `StreamingCodec` methods from core `Codec` interface
- Various conformance/roundtrip tests
- Set of benchmarks for reading/writing an Arrow table to/from Parquet, using each compression codec

### Are these changes tested?

Yes

### Are there any user-facing changes?

- New codec `LZ4Raw` is available
- `Codec` interface no long provides the following methods, which are now part of `StreamingCodec`:
  - `NewReader`
  - `NewWriter`
  - `NewWriterLevel`

* GitHub Issue: apache#43790

Authored-by: Joel Lubinitsky <joellubi@gmail.com>
Signed-off-by: Joel Lubinitsky <joellubi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Add Support for LZ4_RAW Codec

3 participants