-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
A clear and concise description of what the bug is.
(I'm not sure if it's a arrow-rs or arrow-datafusion bug)
Read parquet files with evolved schema can get an error at
https://github.com/apache/arrow-rs/blob/6.0.0/parquet/src/schema/types.rs#L886-L895
It seems that physical plan doesn't pass the desired schema to parquet reader
https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L422
and the ParquetFileArrowReader can only infer schema from file
https://github.com/apache/arrow-rs/blob/6.4.0/parquet/src/arrow/arrow_reader.rs#L86-L92
To Reproduce
Steps to reproduce the behavior:
- Create a parquet file with schema
col_1 int - Create another parquet file with schema
col_1 int, col_2 int - Implement a
TableProviderthat usesParquetExecand also specifies the schema col_1 int, col_2 intinscan` - Register the table and
select * from the_table(since*containscol_2but the some file doesn't have that)
Or
- Create a parquet file with schema
col_1 int - Create another parquet file with schema
col_1 int, col_2 int - Create external table via cli and
select * from the_table
Will got the following error
Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))
Expected behavior
A clear and concise description of what you expected to happen.
The query gets executed without error and returns NULL for col_2 if the file doesn't contain the data.
Additional context
Add any other context about the problem here.