Skip to content

[VL][Spark 3.3+] support match columns use filedIds in native insteads of fallback #2619

@gaoyangxiaozhu

Description

@gaoyangxiaozhu

currently native parquet scan didn't support match columns use filedId when reading, so the example case below would fallback with this PR ready #2563 or return null.

This item use to track let native parquet scan can match column use filedId.

import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Row}
import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, Metadata, MetadataBuilder, StringType, StructType}
import scala.collection.JavaConverters._

val FIELD_ID_METADATA_KEY = "parquet.field.id"

def withId(id: Int): Metadata =
    new MetadataBuilder().putLong(FIELD_ID_METADATA_KEY, id).build()

val readSchema =
new StructType()
  .add("a", StringType, true, withId(0))
  .add("b", IntegerType, true, withId(1))
  

val writeSchema =
new StructType()
  .add("random", IntegerType, true, withId(0))
  .add("name", StringType, true,withId(1))

val writeData = Seq(Row(100, "text"), Row(200, "more"))
spark.createDataFrame(writeData.asJava, writeSchema).write.mode("overwrite").parquet("/tmp/spark1/data")
val df = spark.read.schema(readSchema).parquet("/tmp/spark1/data")

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions