-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
When trying to write a column that is of type List that contains a Struct, the parquet writer throws an error Error: ParquetError(General("Incorrect number of rows, expected 4 != 0 rows")). This seems to be a regression as this works fine in datafusion v32.0.0 but not in v33 or v34. It also works using write_json instead of write_parquet
Example dataframe:
+------------------------------------+
| filters |
+------------------------------------+
| [{filterTypeId: 3, label: LABEL3}] |
| [{filterTypeId: 2, label: LABEL2}] |
+------------------------------------+
To Reproduce
dependencies (working):
[dependencies]
tokio = { version = "1.35.1", features = ["macros"] }
datafusion = { version = "32.0.0", features = ["backtrace"] }
dependencies (broken):
[dependencies]
tokio = { version = "1.35.1", features = ["macros"] }
datafusion = { version = "33.0.0", features = ["backtrace"] }
example.json
{"filters":[{"filterTypeId":3,"label":"LABEL3"}]}
{"filters":[{"filterTypeId":2,"label":"LABEL2"}]}main.rs
use datafusion::{dataframe::DataFrameWriteOptions, error::DataFusionError, prelude::*};
#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
let ctx = SessionContext::new();
let df = ctx
.read_json("example.json", NdJsonReadOptions::default())
.await?;
df.write_parquet("result", DataFrameWriteOptions::default(), None)
.await?;
Ok(())
}Expected behavior
The parquet writer supports writing this kind of datatype as in v32
Additional context
Maybe related to: apache/arrow-rs#1744
I found this issue trying to debug a different one that came up while trying to upgrade from v32 to v34. If the struct contains a timestamp the error instead becomes a Error: Internal("Unable to send array to writer!") with a source error internal error: entered unreachable code: cannot downcast Int64 to byte array.
An example of such a df:
+---------------------------------------------------------------------------------+
| filters |
+---------------------------------------------------------------------------------+
| [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 3, label: LABEL1}] |
| [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 2, label: LABEL2}] |
+--------------------------------------------------------------------------------+
I tried to debug this issue myself looking into the arrow-rs implementation however I didn't manage to find the relevant commit that could have changed this behavior. Also I wasn't sure if I should open the bug in this project or in the arrow-rs project so I hope this is ok 😃.