-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Use Case (What are you trying to do?)
We are trying to organize the implementation of Variant the Rust implementation of parquet and arrow:
We would like to make sure the Rust implementation is compatible with other implementations (that seem mostly JVM / spark focused at the moment).
From what I can tell, the JVM based implementations are tested by verifing round tripped to and from JSON. For example, the ParquetVariantShreddingSuite:
https://github.com/apache/spark/blob/418cfd1f78014698ac4baac21156341a11b771b3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetVariantShreddingSuite.scala#L30
There are several limitations with this approach:
- it doesn't ensure compatibility across language implementations (it only ensures consistency between reader/writer)
- VARIANTs have more types than JSON (e.g. timestamps, etc) so using JSON limits the range of types testable
What do I want
I would like example data in the parquet-testing repository that contains:
- Example binary variant data (e.g. metadata and data fields)
- A parquet file with a column that stores variant data (but does not "shred" any of the columns)
- A parquet file with the same data as 2, but that stores several of the columns "shredded" (aka some of the fields in their own column, as described in 'VariantShredding' when storing in parquet files
Each of the above should have
- some sort of human interpretable description of the encoded values to help verify comparisons (e.g. text, markdown or json)
- Cover the variant scalar types
- Cover the variant nested types (struct, etc)
I recommend keeping the scalar and nested types in separate files / columns to make it easier to incrementally implement variant support (starting with non nested types and then nested types)
Having the above data would permit other parquet implementations to start with a reader that can handle the basic types and then move on to more complex parts (like nested types and shredding). This is similar to how alltypes_plain.parquet is used today.
Suggestions
@cashmand David Cashman suggests on the Parquet Dev list: https://lists.apache.org/thread/22dvcnm7v5d30slzc3hp8d9qq8syj1dq
Hi Andrew, you should be able to create shredded files using OSS Spark
4.0. I think the only issue is that it doesn't have the logical type
annotation yet, so readers wouldn't be able to distinguish it from a
non-variant struct that happens to have the same schema. (Spark is
able to infer that it is a Variant from the
org.apache.spark.sql.parquet.row.metadatametadata.)The ParquetVariantShreddingSuite in Spark has some tests that write
and read shredded parquet files. Below is an example that translates
the first test into code that runs in spark-shell and writes a Parquet
file. The shredding schema is set via conf. If you want to test types
that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
can useto_variant_objectto cast structured values to Variant.I won't have time to work on this in the next couple of weeks, but am
happy to answer any questions.Thanks,
David
scala> import org.apache.spark.sql.internal.SQLConf
scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
"a int, b string, c decimal(15, 1)")
scala> val df = spark.sql(
| """
| | select case
| | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
3.3, "d": 4.4}')
| | when id = 1 then parse_json('{"a": [1,2,3], "b":
"hello", "c": {"x": 0}}')
| | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
| | end v from range(3)
| |""".stripMargin)
scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
scala> spark.read.parquet("/tmp/shredded_test").show
+--------------------+
| v|
+--------------------+
|{"a":1,"b":"2","c...|
|{"a":[1,2,3],"b":...|
| {"A":1,"c":1.23}|
+--------------------+
Subtasks
- Add example binary variant data and regeneration scripts #76
- Add example nested Variant values that are not JSON encodeable #77
- Add example Variant values with larger numbers of fields / array elements #78
- Add example Variant values for primitive types TimeNTZ (Type ID 17) 'timestamp with timezone' (Type ID 18+19) and UUID (Type ID 20) #79
- Potential issues with
Nullvariant example #81 - primitive_int64.value maybe an int32 type #82