Skip to content

Error reading Parquet files after schema evolution #1527

@capkurmagati

Description

@capkurmagati

Describe the bug
A clear and concise description of what the bug is.

(I'm not sure if it's a arrow-rs or arrow-datafusion bug)
Read parquet files with evolved schema can get an error at
https://github.com/apache/arrow-rs/blob/6.0.0/parquet/src/schema/types.rs#L886-L895
It seems that physical plan doesn't pass the desired schema to parquet reader
https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L422
and the ParquetFileArrowReader can only infer schema from file
https://github.com/apache/arrow-rs/blob/6.4.0/parquet/src/arrow/arrow_reader.rs#L86-L92

To Reproduce
Steps to reproduce the behavior:

  1. Create a parquet file with schema col_1 int
  2. Create another parquet file with schema col_1 int, col_2 int
  3. Implement a TableProvider that uses ParquetExec and also specifies the schema col_1 int, col_2 intinscan`
  4. Register the table and select * from the_table (since * contains col_2 but the some file doesn't have that)

Or

  1. Create a parquet file with schema col_1 int
  2. Create another parquet file with schema col_1 int, col_2 int
  3. Create external table via cli and select * from the_table
    Will got the following error

Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))

Expected behavior
A clear and concise description of what you expected to happen.

The query gets executed without error and returns NULL for col_2 if the file doesn't contain the data.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions