Add BYTE_STREAM_SPLIT support to Parquet#15311
Add BYTE_STREAM_SPLIT support to Parquet#15311rapids-bot[bot] merged 55 commits intorapidsai:branch-24.06from
Conversation
|
/ok to test |
|
|
|
/ok to test |
mhaseeb123
left a comment
There was a problem hiding this comment.
A few comments about the use of ts_scale in TIME_MILLIS type as seconds and days are encoded as millis in Parquet plus corresponding time units in tests.
|
Looks good to me. Thanks for the effort @etseidl |
|
/ok to test |
|
/ok to test |
vuule
left a comment
There was a problem hiding this comment.
looks great, just a few small comments
cpp/src/io/parquet/page_enc.cu
Outdated
| is_split_stream ? Encoding::BYTE_STREAM_SPLIT | ||
| : determine_encoding( | ||
| s->page.page_type, physical_type, s->ck.use_dictionary, write_v2_headers); |
There was a problem hiding this comment.
optional: consider moving the is_split_stream check into determine_encoding. This way we have the entire logic in determine_encoding.
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/merge |
…ders (#15832) BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15832
Description
Closes #15226. Part of #13501. Adds support for reading and writing
BYTE_STREAM_SPLITencoded Parquet data. Includes a "microkernel" version like those introduced by #15159.Checklist