-
Notifications
You must be signed in to change notification settings - Fork 473
Description
See related mailing list discussion: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33
Background
In apache/parquet-site#34 we are adding an "implementation status" matrix for different paruqet implementations, to help people understand the supported feature sets of various parquet implementations across the ecosystem.
As we work to fill out this matrix for various parquet implementations, the question arises what does "supports a particular Parquet feature" mean, precisely?
One way to provide a precise definition is to provide a way to automate the check for each feature.
Prior Art
parquet-testing
The parquet-testing repository contains example parquet files written with various different features.
The README.md file contains brief descriptions of the contents of these files, but there is no machine readable description of the data contained within those files.
Apache Arrow
Apache Arrow has a similar feature chart: https://arrow.apache.org/docs/status.html

Part of maintaining this chart is a comprehensive integration suite which programtically checks if data created by one implementation of Arrow can be read by others.
The suite is implemented using a single integration tool called archery, maintained by the Arrow project in the apache/arrow-testing github repo. Each implementation of Arrow implements a driver program that accepts inputs / generates outputs in a known format and then archery orchestrates running that driver program
There are also a number of known "gold files" here which contain JSON representations of data stored in gold master arrow files
Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.
Options
Here are some ideas of what a Parquet compatibility test might look like
Option 1: integration harness similar to archery
In this case, an integration harness similar to archery would handled automatically verifying different implementations. This harness could do orchestrate workflows such as read gold parquet files, as well as write parquet data with one implementation and read it with another and verify their compatibility
Pros:
- Likely the most comprehensive and flexible of options (many different tests possible)
- Could actually verify interoperability between implementations
- Could potentially catch regressions in parquet implementations
Cons:
- Unclear who would create, own and and maintain the test harness code
- Unclear who would keep it up to date / triage errors, build problems, etc
Option 2: Add golden files to parquet-testing
In this option, we would add
- Add
goldenfiles to theparquet-testingrepo (e.g. JSON formatted) coresponding to each existing .parquet file - Document the format of the golden files
- To test supporting a feature on write, an implementation could verify their implementation could produce a .parquet file that when read made the same .golden file again
Each implementation could then check compatibility by creating their own driver program
This approach has a (very) rough prototype here: apache/arrow-rs#5956
parquet-testing
|- data
| |- README.md # textual description of this contents of each file
| |- all_types.plain.parquet
| |- all_types.plain.parquet.json # JSON file with expected contents of all_types.plain.parquet
...
Pros:
- Distributes maintaintenance burden: implementations who were interested in advertising their compatibility would be responsible for verifying their implementation against the golden files
Cons
- Wouldn't be able to catch actual integration errors (where parquet produced by one implementation is not readable by another)
Option 3: Add golden files and files written by other implementations to parquet-testing
@pitrou suggested what I think is an extension of option 2 on apache/arrow-rs#5956 (comment)
My alternative proposal would be a directory tree with pre-generated integration files, something like:
parquet-integration |- all_types.plain.uncompressed | |- README.md # textual description of this integration scenario | |- parquet-java_1.0.pq # file generated by parquet-java 1.0 for said scenario | |- parquet-java_2.5.pq # file generated by parquet-java 2.5 | |- parquet-cpp_16.0.1.pq # file generated by parquet-cpp 16.0.1 |- all_types.dictionary.uncompressed | ...... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job.
The textual
README.mdcould of course be supplemented by a machine-readable JSON format if there's a reasonable way to cover all expected variations with it.
I think this mechanism would allow for cross-implementation integration testing without requiring a unified harness