Add example Variant data and parquet files

## Use Case (What are you trying to do?)

We are trying to organize the implementation of Variant the Rust implementation of parquet and arrow:
- https://github.com/apache/arrow-rs/issues/6736

We would like to make sure the Rust implementation is compatible with other implementations (that seem mostly JVM / spark focused at the moment). 

From what I can tell, the JVM based implementations are tested by verifing round tripped to and from JSON. For example, the `ParquetVariantShreddingSuite`:
https://github.com/apache/spark/blob/418cfd1f78014698ac4baac21156341a11b771b3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetVariantShreddingSuite.scala#L30

There are several limitations with this approach:
1. it doesn't ensure compatibility across language implementations (it only ensures consistency between reader/writer)
2. VARIANTs have more types than JSON (e.g. timestamps, etc) so using JSON limits the range of types testable


## What do I want

I would like example data in the parquet-testing repository that contains:
1. Example binary variant data (e.g. metadata and data fields) 
2. A parquet file with a column that stores variant data (but does not "shred" any of the columns)
3. A parquet file with the same data as 2, but  that stores several of the columns "shredded" (aka some of the fields in their own column, as described in  ['VariantShredding'](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) when storing in parquet files

Each of the above should have 
1. some sort of human interpretable description of the encoded values to help verify comparisons (e.g. text, markdown or json)
2. Cover the variant scalar types
3. Cover the variant nested types (struct, etc) 

I recommend keeping the scalar and nested types in separate files / columns to make it easier to incrementally implement variant support (starting with non nested types and then nested types)


Having the above data would permit other parquet implementations  to start with a reader that can handle  the basic types and then move on to more complex parts (like nested types and shredding). This is similar to how [`alltypes_plain.parquet`](https://github.com/apache/parquet-testing/blob/master/data/alltypes_plain.parquet) is used today. 



## Suggestions

@cashmand David Cashman suggests on the Parquet Dev list: https://lists.apache.org/thread/22dvcnm7v5d30slzc3hp8d9qq8syj1dq


> Hi Andrew, you should be able to create shredded files using OSS Spark
> 4.0. I think the only issue is that it doesn't have the logical type
> annotation yet, so readers wouldn't be able to distinguish it from a
> non-variant struct that happens to have the same schema. (Spark is
> able to infer that it is a Variant from the
> `org.apache.spark.sql.parquet.row.metadata` metadata.)
>
> The ParquetVariantShreddingSuite in Spark has some tests that write
> and read shredded parquet files. Below is an example that translates
> the first test into code that runs in spark-shell and writes a Parquet
> file. The shredding schema is set via conf. If you want to test types
> that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
> can use `to_variant_object` to cast structured values to Variant.
> 
> I won't have time to work on this in the next couple of weeks, but am
> happy to answer any questions.
>
> Thanks,
> David

```
scala> import org.apache.spark.sql.internal.SQLConf
scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
"a int, b string, c decimal(15, 1)")
scala> val df = spark.sql(
     |       """
     |         | select case
     |         | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
3.3, "d": 4.4}')
     |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
"hello", "c": {"x": 0}}')
     |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
     |         | end v from range(3)
     |         |""".stripMargin)
scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
scala> spark.read.parquet("/tmp/shredded_test").show
+--------------------+
|                   v|
+--------------------+
|{"a":1,"b":"2","c...|
|{"a":[1,2,3],"b":...|
|    {"A":1,"c":1.23}|
+--------------------+
```

Subtasks
- [x] https://github.com/apache/parquet-testing/pull/76
- [ ] https://github.com/apache/parquet-testing/issues/77
- [ ] https://github.com/apache/parquet-testing/issues/78
- [x] https://github.com/apache/parquet-testing/issues/79
- [x] #81
- [x] https://github.com/apache/parquet-testing/issues/82





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add example Variant data and parquet files #75

Use Case (What are you trying to do?)

What do I want

Suggestions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add example Variant data and parquet files #75

Description

Use Case (What are you trying to do?)

What do I want

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions