Skip to content

Expose SortingColumn when reading and writing parquet metadata #3090

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.

The parquet file format contains a way to encode the sortedness of data stored there using a "SortingColumn" in the format
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698

Which is then in the RowGroup metadata:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832

However, I did not find any code to read/write this metadata yet in the parquet crate
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard

Describe the solution you'd like

I would like some way to provide the parquet writer the SortingColumn when creating RowgroupMetadata

Perhaps we could add something to the WriterProperties

https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html

Likewise, I would like a way to get the relevant SortingColumn list from RowGroupMetadata:
https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html

Describe alternatives you've considered
It might be worth considering having the parquet writer determine automatically if the data was sorted (maybe this would be better than letting the caller have to verify it)? However, verifying in the writer would likely be a significant performance hit.

Additional context
DataFusion is getting more sophisticated in its ability to track and use sortedness information (e.g. apache/datafusion#4122). If this metadata was included in the parquet file, DataFusion might be able to take more advantage of it: apache/datafusion#4177.

There is more discussion about this topic here apache/datafusion#4169 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changeloggood first issueGood for newcomersparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions