Skip to content

Cannot write a column of type DataType::List containing a DataType::Struct to parquet with parallel writing #8851

@vigimite

Description

@vigimite

Describe the bug

When trying to write a column that is of type List that contains a Struct, the parquet writer throws an error Error: ParquetError(General("Incorrect number of rows, expected 4 != 0 rows")). This seems to be a regression as this works fine in datafusion v32.0.0 but not in v33 or v34. It also works using write_json instead of write_parquet

Example dataframe:

+------------------------------------+
| filters                            |
+------------------------------------+
| [{filterTypeId: 3, label: LABEL3}] |
| [{filterTypeId: 2, label: LABEL2}] |
+------------------------------------+

To Reproduce

dependencies (working):

[dependencies]
tokio = { version = "1.35.1", features = ["macros"] }
datafusion = { version = "32.0.0", features = ["backtrace"] }

dependencies (broken):

[dependencies]
tokio = { version = "1.35.1", features = ["macros"] }
datafusion = { version = "33.0.0", features = ["backtrace"] }

example.json

{"filters":[{"filterTypeId":3,"label":"LABEL3"}]}
{"filters":[{"filterTypeId":2,"label":"LABEL2"}]}

main.rs

use datafusion::{dataframe::DataFrameWriteOptions, error::DataFusionError, prelude::*};

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let ctx = SessionContext::new();

    let df = ctx
        .read_json("example.json", NdJsonReadOptions::default())
        .await?;

    df.write_parquet("result", DataFrameWriteOptions::default(), None)
        .await?;
    Ok(())
}

Expected behavior

The parquet writer supports writing this kind of datatype as in v32

Additional context

Maybe related to: apache/arrow-rs#1744

I found this issue trying to debug a different one that came up while trying to upgrade from v32 to v34. If the struct contains a timestamp the error instead becomes a Error: Internal("Unable to send array to writer!") with a source error internal error: entered unreachable code: cannot downcast Int64 to byte array.

An example of such a df:

+---------------------------------------------------------------------------------+
| filters                                                                         |
+---------------------------------------------------------------------------------+
| [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 3, label: LABEL1}] |
| [{assignmentStartTs: 2023-11-11T11:11:11.000Z, filterTypeId: 2, label: LABEL2}] |
+--------------------------------------------------------------------------------+

I tried to debug this issue myself looking into the arrow-rs implementation however I didn't manage to find the relevant commit that could have changed this behavior. Also I wasn't sure if I should open the bug in this project or in the arrow-rs project so I hope this is ok 😃.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions