Skip to content

Parquet compatibility / integration testing #441

@alamb

Description

@alamb

See related mailing list discussion: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33

Background

In apache/parquet-site#34 we are adding an "implementation status" matrix for different paruqet implementations, to help people understand the supported feature sets of various parquet implementations across the ecosystem.

As we work to fill out this matrix for various parquet implementations, the question arises what does "supports a particular Parquet feature" mean, precisely?

One way to provide a precise definition is to provide a way to automate the check for each feature.

Prior Art

parquet-testing

The parquet-testing repository contains example parquet files written with various different features.

The README.md file contains brief descriptions of the contents of these files, but there is no machine readable description of the data contained within those files.

Apache Arrow

Apache Arrow has a similar feature chart: https://arrow.apache.org/docs/status.html
Screenshot 2024-06-26 at 10 36 05 AM

Part of maintaining this chart is a comprehensive integration suite which programtically checks if data created by one implementation of Arrow can be read by others.

The suite is implemented using a single integration tool called archery, maintained by the Arrow project in the apache/arrow-testing github repo. Each implementation of Arrow implements a driver program that accepts inputs / generates outputs in a known format and then archery orchestrates running that driver program

There are also a number of known "gold files" here which contain JSON representations of data stored in gold master arrow files

Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.

Options

Here are some ideas of what a Parquet compatibility test might look like

Option 1: integration harness similar to archery

In this case, an integration harness similar to archery would handled automatically verifying different implementations. This harness could do orchestrate workflows such as read gold parquet files, as well as write parquet data with one implementation and read it with another and verify their compatibility

Pros:

  • Likely the most comprehensive and flexible of options (many different tests possible)
  • Could actually verify interoperability between implementations
  • Could potentially catch regressions in parquet implementations

Cons:

  • Unclear who would create, own and and maintain the test harness code
  • Unclear who would keep it up to date / triage errors, build problems, etc

Option 2: Add golden files to parquet-testing

In this option, we would add

  1. Add golden files to the parquet-testing repo (e.g. JSON formatted) coresponding to each existing .parquet file
  2. Document the format of the golden files
  3. To test supporting a feature on write, an implementation could verify their implementation could produce a .parquet file that when read made the same .golden file again

Each implementation could then check compatibility by creating their own driver program

This approach has a (very) rough prototype here: apache/arrow-rs#5956

parquet-testing
|- data
|  |- README.md   # textual description of this contents of each file
|  |- all_types.plain.parquet
|  |- all_types.plain.parquet.json # JSON file with expected contents of all_types.plain.parquet
...

Pros:

  • Distributes maintaintenance burden: implementations who were interested in advertising their compatibility would be responsible for verifying their implementation against the golden files

Cons

  • Wouldn't be able to catch actual integration errors (where parquet produced by one implementation is not readable by another)

Option 3: Add golden files and files written by other implementations to parquet-testing

@pitrou suggested what I think is an extension of option 2 on apache/arrow-rs#5956 (comment)

My alternative proposal would be a directory tree with pre-generated integration files, something like:

parquet-integration
|- all_types.plain.uncompressed
|  |- README.md   # textual description of this integration scenario
|  |- parquet-java_1.0.pq  # file generated by parquet-java 1.0 for said scenario
|  |- parquet-java_2.5.pq  # file generated by parquet-java 2.5
|  |- parquet-cpp_16.0.1.pq  # file generated by parquet-cpp 16.0.1
|- all_types.dictionary.uncompressed
| ...

... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job.

The textual README.md could of course be supplemented by a machine-readable JSON format if there's a reasonable way to cover all expected variations with it.

I think this mechanism would allow for cross-implementation integration testing without requiring a unified harness

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions