Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented May 31, 2025

Which issue does this PR close?

Rationale for this change

Statistics for large columns (e.g. large strings) are typically not useful for min/max value pruning.

However, the current defaults in parquet-rs will store the entire min and max value.

For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each data page header

What changes are included in this PR?

Change default statistics truncation size to be 64 to match the default for truncating PageIndex statistics

Are there any user-facing changes?

This is a user facing change -- I expect users will see:

  1. Smaller parquet metadata (and thus smaller parquet files)
  2. Faster load times (as the metadata is smaller)

It is an API change, so we should wait to merge this until the next major release

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label May 31, 2025
@github-actions github-actions bot added the parquet Changes to the parquet crate label May 31, 2025
@alamb alamb added the api-change Changes to the arrow API label May 31, 2025
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@alamb alamb changed the title Change default statistics truncation to be 64 bytes Change default parquet statistics truncation to be 64 bytes Jun 1, 2025
@etseidl
Copy link
Contributor

etseidl commented Jun 26, 2025

@alamb would you like me to start merging the deferred to 56.0.0 PRs, or do you prefer to do that yourself?

@alamb
Copy link
Contributor Author

alamb commented Jun 27, 2025

@alamb would you like me to start merging the deferred to 56.0.0 PRs, or do you prefer to do that yourself?

Please do!

@etseidl etseidl merged commit 06cbc33 into apache:main Jun 27, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider a default max_statistics_truncate_length

3 participants