Skip to content

Add example Variant data and parquet files #75

@alamb

Description

@alamb

Use Case (What are you trying to do?)

We are trying to organize the implementation of Variant the Rust implementation of parquet and arrow:

We would like to make sure the Rust implementation is compatible with other implementations (that seem mostly JVM / spark focused at the moment).

From what I can tell, the JVM based implementations are tested by verifing round tripped to and from JSON. For example, the ParquetVariantShreddingSuite:
https://github.com/apache/spark/blob/418cfd1f78014698ac4baac21156341a11b771b3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetVariantShreddingSuite.scala#L30

There are several limitations with this approach:

  1. it doesn't ensure compatibility across language implementations (it only ensures consistency between reader/writer)
  2. VARIANTs have more types than JSON (e.g. timestamps, etc) so using JSON limits the range of types testable

What do I want

I would like example data in the parquet-testing repository that contains:

  1. Example binary variant data (e.g. metadata and data fields)
  2. A parquet file with a column that stores variant data (but does not "shred" any of the columns)
  3. A parquet file with the same data as 2, but that stores several of the columns "shredded" (aka some of the fields in their own column, as described in 'VariantShredding' when storing in parquet files

Each of the above should have

  1. some sort of human interpretable description of the encoded values to help verify comparisons (e.g. text, markdown or json)
  2. Cover the variant scalar types
  3. Cover the variant nested types (struct, etc)

I recommend keeping the scalar and nested types in separate files / columns to make it easier to incrementally implement variant support (starting with non nested types and then nested types)

Having the above data would permit other parquet implementations to start with a reader that can handle the basic types and then move on to more complex parts (like nested types and shredding). This is similar to how alltypes_plain.parquet is used today.

Suggestions

@cashmand David Cashman suggests on the Parquet Dev list: https://lists.apache.org/thread/22dvcnm7v5d30slzc3hp8d9qq8syj1dq

Hi Andrew, you should be able to create shredded files using OSS Spark
4.0. I think the only issue is that it doesn't have the logical type
annotation yet, so readers wouldn't be able to distinguish it from a
non-variant struct that happens to have the same schema. (Spark is
able to infer that it is a Variant from the
org.apache.spark.sql.parquet.row.metadata metadata.)

The ParquetVariantShreddingSuite in Spark has some tests that write
and read shredded parquet files. Below is an example that translates
the first test into code that runs in spark-shell and writes a Parquet
file. The shredding schema is set via conf. If you want to test types
that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
can use to_variant_object to cast structured values to Variant.

I won't have time to work on this in the next couple of weeks, but am
happy to answer any questions.

Thanks,
David

scala> import org.apache.spark.sql.internal.SQLConf
scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
"a int, b string, c decimal(15, 1)")
scala> val df = spark.sql(
     |       """
     |         | select case
     |         | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
3.3, "d": 4.4}')
     |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
"hello", "c": {"x": 0}}')
     |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
     |         | end v from range(3)
     |         |""".stripMargin)
scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
scala> spark.read.parquet("/tmp/shredded_test").show
+--------------------+
|                   v|
+--------------------+
|{"a":1,"b":"2","c...|
|{"a":[1,2,3],"b":...|
|    {"A":1,"c":1.23}|
+--------------------+

Subtasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions