Skip to content

Handle BYTE_ARRAY physical type in arrow-json (be able to load files output from pandas with no dtypes) #3373

@ehiggs

Description

@ehiggs

Goal

arrow-json should be able to load parquet files output from python pandas with no dtypes.

Use case

Given the following python code:

import pandas as pd
data = '[{"a": 1, "b": "Hello", "c": {"d": "something"}, "e": [1,2,3]}]'
df = pd.read_json(data, dtype=False, orient='record')
df.to_parquet("test.parquet", engine="fastparquet", object_encoding="json", stats=False)
df2 = pd.read_parquet("test.parquet", engine="fastparquet")
print(df2)
print(df2.dtypes)

This outputs:

   a      b                   c          e
0  1  Hello  {'d': 'something'}  [1, 2, 3]
a     int64
b    object
c    object
e    object
dtype: object

The types aren't great, but it can write and the file is loaded. ✅

Using VSCode parquet-viewer plugin (TypeScript) we can see the loaded data:

image

The Typescript/Javascript implementation is able to load the file ✅

However, when I try to load this using arrow-json, I seethe following error:

async fn parquet_to_json<T>(data: T) where T: AsyncFileReader + Send + Unpin + 'static {

    let builder = ParquetRecordBatchStreamBuilder::new(data)
        .await
        .unwrap()
        .with_batch_size(3);
    let file_metadata = builder.metadata().file_metadata();
    println!("schema: {:?}", file_metadata.schema_descr());

    let stream = builder.build().unwrap();
    let results = stream.try_collect::<Vec<_>>().await.unwrap();
    let mut out_buf = Vec::new();
    let mut writer = LineDelimitedWriter::new(&mut out_buf);
    writer
        .write_batches(&results)
        .expect("could not write batches");
    let json_out = String::from_utf8_lossy(&out_buf);
    println!("result: {}", json_out);
}
thread 'main' panicked at 'could not write batches: JsonError("data type Binary not supported in nested map for json writer")'

The schema as arrow-rs knows it:

schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "a", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: 64, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "b", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "c", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "e", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }] } }

I don't know what the parquet spec days here but basic files are loadable from other implementations, and being able to read files output from pandas must surely be a significant use case.

Related tickets / PRs:

Related ticket: #154
BinaryArray doesn't exist (anymore?) as I only see Binary as a DataType and BYTE_ARRAY in the schema output, so I wasn't sure if this was the same issue.

There was a previous PR for the above ticket: apache/arrow#8971 which was closed. This looks like this also would have failed to do 'the right thing'.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions