Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Jun 2, 2025

Which issue does this PR close?

Rationale for this change

Reduces metadata bloat by not writing redundant statistics to each Parquet page header.

What changes are included in this PR?

Changes the meaning of EnabledStatistics::Page. Currently this level means statistics will be written to the column chunk, page header, and column index. With this PR Page now means writing to the column chunk and column index. Writing to the page header can be enabled using an added write_page_header_statistics writer option.

Also adds some command line switches to the parquet-rewrite tool.

Are there any user-facing changes?

No breaking API changes, but an added option and behavior change.

Specifically, statistics are no longer written to data Page headers by default. If you want them (not common) you will have to explictly enable doing so

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 2, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Jun 2, 2025

Setting this to draft for now since it depends on #7555.

@etseidl etseidl added the enhancement Any new improvement worthy of a entry in the changelog label Jun 2, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I went over this PR carefully and I think it looks really nice.

I couldn't help myself so I also merged up from main to resolve a conflict, added an explicit test for writing page statistics, and tweaked some docs.

I think this is going to make a non trivial difference for people who store data using parquet-rs and enable page level statistics. THANK YOU

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label Jun 9, 2025
@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

I think we should wait for the next major release before merging this PR

@alamb alamb added the api-change Changes to the arrow API label Jun 9, 2025
@alamb alamb changed the title Add option to control writing of statistics to Parquet page headers Stop writing statistics to Parquet page headers by default, add option to enable Jun 9, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Jun 9, 2025

I couldn't help myself so I also merged up from main to resolve a conflict, added an explicit test for writing page statistics, and tweaked some docs.

🤣 Thanks @alamb! Changes look great.

@etseidl etseidl merged commit c316974 into apache:main Jun 27, 2025
16 checks passed
@etseidl etseidl deleted the no_page_header_stats branch June 27, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some way to avoid writing redundant statistics into data page headers

2 participants