Skip to content

GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64#40094

Merged
pitrou merged 3 commits intoapache:mainfrom
pitrou:gh39978-byte-stream-split
Mar 19, 2024
Merged

GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64#40094
pitrou merged 3 commits intoapache:mainfrom
pitrou:gh39978-byte-stream-split

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Feb 15, 2024

What changes are included in this PR?

Implement the format addition described in https://issues.apache.org/jira/browse/PARQUET-2414 .

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes (additional types supported for Parquet encoding).

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 15, 2024

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@pitrou pitrou marked this pull request as ready for review February 15, 2024 17:45
@pitrou pitrou requested a review from wgtmac as a code owner February 15, 2024 17:45
@pitrou pitrou requested a review from mapleFU February 15, 2024 17:47
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 15, 2024

The i386 test failure is related.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only special FLBA would touch this case? I'm ok with the code but I guess it's merely touched.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlainEncoder<FLBAType> is implemented similarly as this. The only additional step here is ByteStreamSplitEncode, which is skipped for single-byte FLBA type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a base class, it's uncessary to have a default pool here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we print the size, which could be helpful for debugging?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm why forcing the NextPower2 here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that the decode buffer is resized less often, if the number of values varies a bit every time.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 17, 2024
@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Feb 17, 2024

Sorry for being late because I'm on my Spring Festival vacation previously, this general LGTM, just some minor comments. And do we have benchmark data changing here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?

@pitrou pitrou force-pushed the gh39978-byte-stream-split branch from bb540f3 to 7a19483 Compare February 19, 2024 10:59
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 19, 2024

@github-actions crossbow submit -g cpp

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 19, 2024

@ursabot please benchmark

@ursabot
Copy link
Copy Markdown

ursabot commented Feb 19, 2024

Benchmark runs are scheduled for commit 7a19483faa454e37a5782958fa41693754b3d5b0. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@github-actions

This comment was marked as outdated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just founded that this doesn't has a memory_pool. Should we having that here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary buffer, so I'm not sure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh only DecodeArrow uses this

Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would waiting for benchmark result

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 19, 2024

I still need to fix the i386 failure.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 19, 2024

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@pitrou pitrou changed the title GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 DO NOT MERGE: GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 Feb 19, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7a19483faa454e37a5782958fa41693754b3d5b0.

There were 43 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 19, 2024

The regressions above are because I made data generation more realistic and therefore less trivially compressible:
https://github.com/apache/arrow/pull/40094/files#diff-e0338b1864949c8001485874b6249a5a756c83c9671e4a31ee523289035487e5R171

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 29, 2024

Also, I've generated a test file here: apache/parquet-testing#46

@pitrou pitrou force-pushed the gh39978-byte-stream-split branch from 771ad51 to b8bd382 Compare March 18, 2024 12:47
@pitrou pitrou changed the title DO NOT MERGE: GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 Mar 18, 2024
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Mar 18, 2024

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Mar 18, 2024

@github-actions crossbow submit -g cpp

@github-actions
Copy link
Copy Markdown

Revision: 93ebd84

Submitted crossbow builds: ursacomputing/crossbow @ actions-f4c9c5c778

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou pitrou requested a review from wgtmac March 18, 2024 15:58
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

#endif
switch (width) {
case 1:
memcpy(out, raw_values, num_values);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this, but seems it's equal to PLAIN 🤔?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes, by definition.


inline void ByteStreamSplitEncodeScalarDynamic(const uint8_t* raw_values, int width,
const int64_t num_values, uint8_t* out) {
::arrow::internal::SmallVector<uint8_t*, 16> dest_streams;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this benifits performance?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, though perhaps not a in a micro-benchmark where allocations are reused efficiently.

Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is LGTM, below are some minor questions

@pitrou pitrou merged commit a364e4a into apache:main Mar 19, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Mar 19, 2024
@pitrou pitrou deleted the gh39978-byte-stream-split branch March 19, 2024 11:07
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a364e4a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 62 possible false positives for unstable benchmarks that are known to sometimes produce them.

pitrou pushed a commit that referenced this pull request May 7, 2024
…amSplitDecoder (#41565)

### Rationale for this change

This problem is raised from  #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: #41562

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit that referenced this pull request May 8, 2024
…amSplitDecoder (#41565)

### Rationale for this change

This problem is raised from  #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: #41562

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche pushed a commit that referenced this pull request May 22, 2024
…rite_table() docstring (#41759)

### Rationale for this change

In PR #40094 (issue GH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT.

### Are these changes tested?

No (only a doc change).

### Are there any user-facing changes?

No.
* GitHub Issue: #41748

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…teStreamSplitDecoder (apache#41565)

### Rationale for this change

This problem is raised from  apache#40094 . Original bug fixed here: apache#34140 , but this is corrupt in apache#40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: apache#41562

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…n in write_table() docstring (apache#41759)

### Rationale for this change

In PR apache#40094 (issue apacheGH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT.

### Are these changes tested?

No (only a doc change).

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41748

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Jun 24, 2024
…ders (#15832)

BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15832
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants