Skip to content

Truncate stats from Parquet files#254

Merged
rdblue merged 3 commits into
apache:masterfrom
vgankidi:parquetmetricstruncate
Jul 6, 2019
Merged

Truncate stats from Parquet files#254
rdblue merged 3 commits into
apache:masterfrom
vgankidi:parquetmetricstruncate

Conversation

@vgankidi

@vgankidi vgankidi commented Jul 4, 2019

Copy link
Copy Markdown
Contributor

Lower and upper bound values from Parquet files are not currently truncated, which takes more space than necessary in manifests. Truncating strings and binary values will probably improve performance for large tables.

This PR adds a configurable table property "write.metadata.truncate-length" with a default value of 16. Default behavior is to truncate binary values to <= 16 bytes and strings to <= 16 unicode characters.

Resolves #113

@rdblue

rdblue commented Jul 5, 2019

Copy link
Copy Markdown
Contributor

@vgankidi, it looks like this is based on an old version of your branch. Could you update it to the current?

Comment thread api/src/main/java/org/apache/iceberg/types/Comparators.java Outdated
@rdblue

rdblue commented Jul 5, 2019

Copy link
Copy Markdown
Contributor

@edgarRd, FYI. This PR has methods to truncate binary and string values to avoid writing huge stats in table metadata. Probably something you'll want to use for ORC.

}

public static Metrics footerMetrics(ParquetMetadata metadata) {
return footerMetrics(metadata, TableProperties.WRITE_METADATA_TRUNCATE_BYTES_DEFAULT);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need for this to be left. The truncate length should always be included when getting metrics.

The version that uses the default is called from two places:

  • ParquetWriteAdapter that is created by Parquet.write when the internal ParquetWriter is not used
  • SparkTableUtil that reads Parquet footers when converting Hive tables to Iceberg metadata

I think both should be updated. The write adapter should use the config setting, and the Spark util method can default this inline. Then we can get rid of this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be good to get rid of this. I updated the PR with these changes.

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
Comment thread parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
@rdblue rdblue merged commit 3287fee into apache:master Jul 6, 2019
@rdblue

rdblue commented Jul 6, 2019

Copy link
Copy Markdown
Contributor

Thanks for the quick fixes, @vgankidi! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Truncate stats from Parquet files

2 participants