Parquet compatibility / integration testing

See related mailing list discussion: https://lists.apache.org/thread/kd3k4q691lp5c4q3r767zb8jltrm9z33

## Background

In https://github.com/apache/parquet-site/pull/34 we are adding an "implementation status" matrix for different paruqet implementations, to help people understand the supported feature sets of various parquet implementations across the ecosystem.

As we work to fill out this matrix for various parquet implementations, the question arises what does "supports a particular Parquet feature" mean, precisely?

One way to provide a precise definition is to provide a way to automate the check for each feature.

## Prior Art

### `parquet-testing`

The [parquet-testing](https://github.com/apache/parquet-testing) repository contains example parquet files written with various different features.

The [`README.md`](https://github.com/apache/parquet-testing/blob/master/data/README.md) file contains brief descriptions of the contents of these files, but there is no machine readable description of the data contained within those files.

### Apache Arrow
Apache Arrow has a similar feature chart: https://arrow.apache.org/docs/status.html
<img width="830" alt="Screenshot 2024-06-26 at 10 36 05 AM" src="https://github.com/apache/parquet-format/assets/490673/cce7de7b-629a-43f2-9125-6df771f13632">

Part of maintaining this chart is a comprehensive [integration suite](https://arrow.apache.org/docs/format/Integration.html) which programtically checks if data created by one implementation of Arrow can be read by others.

The suite is implemented using a single integration tool called `archery`, maintained by the Arrow project in the [apache/arrow-testing](https://github.com/apache/arrow-testing/tree/master) github repo. Each implementation of Arrow implements a driver program that accepts inputs / generates outputs in a known format and then archery orchestrates running that driver program

There are also a number of known "gold files" [here](https://arrow.apache.org/docs/format/Integration.html#archery-integration-test-cases) which contain JSON representations of data stored in gold master arrow files

Note that Arrow is somewhat different than Parquet in that most of the Arrow implementations are maintained by the Apache Arrow project itself. In comparison, I believe most of the Parquet implementations are maintained by projects / teams other than Apache Parquet.

## Options

Here are some ideas of what a Parquet compatibility test might look like

## Option 1: integration harness similar to `archery`

In this case,  an integration harness similar to `archery` would handled automatically verifying different implementations. This harness could do orchestrate workflows such as read gold parquet files, as well as write parquet data with one implementation and read it with another and verify their compatibility

Pros:
* Likely the most comprehensive and flexible of options (many different tests possible)
* Could actually verify interoperability between implementations
* Could potentially catch regressions in parquet implementations

Cons:
* Unclear who would create, own and and maintain the test harness code
* Unclear who would keep it up to date / triage errors, build problems, etc

## Option 2: Add golden files to `parquet-testing`

In this option, we would add
1. Add `golden` files to the `parquet-testing` repo (e.g. JSON formatted) coresponding to each existing .parquet file
2. Document the format of the golden files
3. To test supporting a feature on write, an implementation could verify their implementation could produce a .parquet file that when read made the same .golden file again

Each implementation could then check compatibility by creating their own driver program

This approach has a (very) rough prototype here: https://github.com/apache/arrow-rs/pull/5956

```
parquet-testing
|- data
|  |- README.md   # textual description of this contents of each file
|  |- all_types.plain.parquet
|  |- all_types.plain.parquet.json # JSON file with expected contents of all_types.plain.parquet
...
```

Pros:
* Distributes maintaintenance burden: implementations who were interested in advertising their compatibility would be responsible for verifying their implementation against the golden files

Cons
* Wouldn't be able to catch actual integration errors (where parquet produced by one implementation is not readable by another)

## Option 3: Add golden files and files written by other implementations to `parquet-testing`

@pitrou suggested what I think is an extension of option 2 on https://github.com/apache/arrow-rs/pull/5956#issuecomment-2191142596 

> My alternative proposal would be a directory tree with pre-generated integration files, something like:
> 
> ```
> parquet-integration
> |- all_types.plain.uncompressed
> |  |- README.md   # textual description of this integration scenario
> |  |- parquet-java_1.0.pq  # file generated by parquet-java 1.0 for said scenario
> |  |- parquet-java_2.5.pq  # file generated by parquet-java 2.5
> |  |- parquet-cpp_16.0.1.pq  # file generated by parquet-cpp 16.0.1
> |- all_types.dictionary.uncompressed
> | ...
> ```
> 
> ... which allows us to have many different scenarios without the scaling problem of having all implementations run within the same CI job.
> 
> The textual `README.md` could of course be supplemented by a machine-readable JSON format if there's a reasonable way to cover all expected variations with it.

I think this mechanism would allow for cross-implementation integration testing without requiring a unified harness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet compatibility / integration testing #441

Background

Prior Art

`parquet-testing`

Apache Arrow

Options

Option 1: integration harness similar to `archery`

Option 2: Add golden files to `parquet-testing`

Option 3: Add golden files and files written by other implementations to `parquet-testing`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet compatibility / integration testing #441

Description

Background

Prior Art

parquet-testing

Apache Arrow

Options

Option 1: integration harness similar to archery

Option 2: Add golden files to parquet-testing

Option 3: Add golden files and files written by other implementations to parquet-testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`parquet-testing`

Option 1: integration harness similar to `archery`

Option 2: Add golden files to `parquet-testing`

Option 3: Add golden files and files written by other implementations to `parquet-testing`