GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64#40094
Conversation
|
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
|
The i386 test failure is related. |
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
So only special FLBA would touch this case? I'm ok with the code but I guess it's merely touched.
There was a problem hiding this comment.
Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?
There was a problem hiding this comment.
PlainEncoder<FLBAType> is implemented similarly as this. The only additional step here is ByteStreamSplitEncode, which is skipped for single-byte FLBA type.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Since it's a base class, it's uncessary to have a default pool here?
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
can we print the size, which could be helpful for debugging?
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Emmm why forcing the NextPower2 here?
There was a problem hiding this comment.
So that the decode buffer is resized less often, if the number of values varies a bit every time.
|
Sorry for being late because I'm on my Spring Festival vacation previously, this general LGTM, just some minor comments. And do we have benchmark data changing here? |
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?
bb540f3 to
7a19483
Compare
|
@github-actions crossbow submit -g cpp |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for commit 7a19483faa454e37a5782958fa41693754b3d5b0. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
This comment was marked as outdated.
This comment was marked as outdated.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Just founded that this doesn't has a memory_pool. Should we having that here?
There was a problem hiding this comment.
This is a temporary buffer, so I'm not sure.
mapleFU
left a comment
There was a problem hiding this comment.
Would waiting for benchmark result
|
I still need to fix the i386 failure. |
|
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
|
Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7a19483faa454e37a5782958fa41693754b3d5b0. There were 43 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
The regressions above are because I made data generation more realistic and therefore less trivially compressible: |
|
Also, I've generated a test file here: apache/parquet-testing#46 |
…XED_LEN_BYTE_ARRAY, INT32 and INT64
771ad51 to
b8bd382
Compare
|
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
|
@github-actions crossbow submit -g cpp |
|
Revision: 93ebd84 Submitted crossbow builds: ursacomputing/crossbow @ actions-f4c9c5c778 |
| #endif | ||
| switch (width) { | ||
| case 1: | ||
| memcpy(out, raw_values, num_values); |
There was a problem hiding this comment.
I'm ok with this, but seems it's equal to PLAIN 🤔?
There was a problem hiding this comment.
Well, yes, by definition.
|
|
||
| inline void ByteStreamSplitEncodeScalarDynamic(const uint8_t* raw_values, int width, | ||
| const int64_t num_values, uint8_t* out) { | ||
| ::arrow::internal::SmallVector<uint8_t*, 16> dest_streams; |
There was a problem hiding this comment.
Would this benifits performance?
There was a problem hiding this comment.
Potentially, though perhaps not a in a micro-benchmark where allocations are reused efficiently.
mapleFU
left a comment
There was a problem hiding this comment.
This is LGTM, below are some minor questions
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a364e4a. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 62 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…amSplitDecoder (#41565) ### Rationale for this change This problem is raised from #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: #41562 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…amSplitDecoder (#41565) ### Rationale for this change This problem is raised from #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: #41562 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…rite_table() docstring (#41759) ### Rationale for this change In PR #40094 (issue GH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT. ### Are these changes tested? No (only a doc change). ### Are there any user-facing changes? No. * GitHub Issue: #41748 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…teStreamSplitDecoder (apache#41565) ### Rationale for this change This problem is raised from apache#40094 . Original bug fixed here: apache#34140 , but this is corrupt in apache#40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: apache#41562 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…n in write_table() docstring (apache#41759) ### Rationale for this change In PR apache#40094 (issue apacheGH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT. ### Are these changes tested? No (only a doc change). ### Are there any user-facing changes? No. * GitHub Issue: apache#41748 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…ders (#15832) BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15832
What changes are included in this PR?
Implement the format addition described in https://issues.apache.org/jira/browse/PARQUET-2414 .
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes (additional types supported for Parquet encoding).