Skip to content

Conversation

@jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Mar 28, 2025

Which issue does this PR close?

Rationale for this change

PageEncodingStats can be useful to determine up front whether all pages of a column are dictionary encoded. Keeping track of the counts per page type and encoding is also quite cheap, and their metadata size is small, so they can be enabled always independent of the EnabledStatistics setting.

What changes are included in this PR?

Are there any user-facing changes?

No api changes, written files will now include PageEncodingStats which previously were always None.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

Thanks @jhorstmann -- I am running the write benchmarks on this PR to see if there is any performance impact

@etseidl
Copy link
Contributor

etseidl commented Mar 28, 2025

I'll take a look soon. I had no idea this wasn't supported 😅. Sounds like something to add to the implementation status page.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

My benchmark results show no significant change in performance

++ critcmp main write-page-encoding-stats
group                                                                     main                                   write-page-encoding-stats
-----                                                                     ----                                   -------------------------
write_batch nested/4096 values primitive list                             1.02      7.0±0.24ms   304.9 MB/sec    1.00      6.8±0.27ms   312.2 MB/sec
write_batch nested/4096 values primitive list non-null                    1.03      8.4±0.33ms   253.8 MB/sec    1.00      8.2±0.29ms   260.3 MB/sec
write_batch primitive/4096 values bool                                    1.01  1022.3±42.00µs  1062.3 KB/sec    1.00  1014.0±33.50µs  1070.9 KB/sec
write_batch primitive/4096 values bool non-null                           1.00   944.9±25.45µs   620.1 KB/sec    1.00   940.3±32.14µs   623.2 KB/sec
write_batch primitive/4096 values float with NaNs                         1.01  1841.7±48.57µs    29.8 MB/sec    1.00  1831.0±68.81µs    30.0 MB/sec
write_batch primitive/4096 values primitive                               1.02      2.9±0.13ms    60.0 MB/sec    1.00      2.9±0.10ms    61.2 MB/sec
write_batch primitive/4096 values primitive non-null                      1.00      2.8±0.54ms    61.7 MB/sec    1.01      2.8±0.71ms    60.8 MB/sec
write_batch primitive/4096 values primitive non-null with bloom filter    1.00     21.6±0.69ms     8.0 MB/sec    1.00     21.5±0.36ms     8.0 MB/sec
write_batch primitive/4096 values primitive with bloom filter             1.00     21.3±0.31ms     8.2 MB/sec    1.01     21.5±1.25ms     8.2 MB/sec
write_batch primitive/4096 values string                                  1.00  1734.3±48.79µs    72.8 MB/sec    1.00  1742.8±59.87µs    72.4 MB/sec
write_batch primitive/4096 values string dictionary                       1.00      4.1±0.16ms   253.3 MB/sec    1.02      4.1±0.19ms   249.2 MB/sec
write_batch primitive/4096 values string dictionary with bloom filter     1.02      6.0±0.37ms   172.2 MB/sec    1.00      5.9±0.40ms   174.8 MB/sec
write_batch primitive/4096 values string non-null                         1.00      7.3±0.36ms   281.0 MB/sec    1.00      7.3±0.44ms   281.0 MB/sec
write_batch primitive/4096 values string non-null with bloom filter       1.05     11.4±0.41ms   180.2 MB/sec    1.00     10.9±0.44ms   188.4 MB/sec
write_batch primitive/4096 values string with bloom filter                1.02     11.5±0.47ms   177.5 MB/sec    1.00     11.3±0.44ms   181.7 MB/sec

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks @jhorstmann for noticing this was missing!

@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

I merged up from main to resolve some merge conflicts

@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

Merged to get clippy fixes for new Clippy version

@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

🚀 we have a clean CI run

Thanks again @jhorstmann and @etseidl

@alamb alamb merged commit abb2eef into apache:main Apr 3, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ColumnMetaData encoding_stats in Parquet Writing

3 participants