Use page statistics in Parquet reader#14973
Use page statistics in Parquet reader#14973rapids-bot[bot] merged 29 commits intorapidsai:branch-24.04from
Conversation
|
parquet_read_decode benchmark |
|
parquet_read_chunks benchmark and the same but forcing PLAIN encoding |
|
/ok to test |
vuule
left a comment
There was a problem hiding this comment.
I already review the previous version of this feature, but going over it again since I didn't quite approve it back then (I almost did, IIRC).
Few nitpicks so far.
vuule
left a comment
There was a problem hiding this comment.
Looks good, just one comment that reads a bit odd to me.
|
|
||
| // calculate num_nulls if not available from column index | ||
| if (not pg_info.num_nulls.has_value()) { | ||
| pg_info.num_nulls = std::reduce(h, h + max_def_level); |
There was a problem hiding this comment.
the least verbose use of a standard algorithm I've ever seen :D
| // info is missing or insufficient, then just return without modifying the row_group_info. | ||
| if (not pg_info.num_nulls.has_value() or not pg_info.num_valid.has_value()) { return; } | ||
|
|
||
| // if using older page indexes that lack size info, don't use |
There was a problem hiding this comment.
| // if using older page indexes that lack size info, don't use | |
| // if using older page indexes that lack size info, don't use `pg_info` |
There was a problem hiding this comment.
Cleaned this up and added a more detailed TODO...will push once CI finishes
There was a problem hiding this comment.
looking at a few other PRs, this might be an "if CI finishes" situation 😬
|
/ok to test |
|
/ok to test |
|
Looks like #15020 broke this (db 3 ets 0 😮 😭 🤣). Regrouping... |
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
|
/ok to test |
|
/merge |
Description
#14000 added the ability to write new page statistics to the Parquet writer. This PR uses these new statistics to avoid some string size computations. Benchmarks show an improvement in read times of up to 20%.
Checklist