Goal
arrow-json should be able to load parquet files output from python pandas with no dtypes.
Use case
Given the following python code:
import pandas as pd
data = '[{"a": 1, "b": "Hello", "c": {"d": "something"}, "e": [1,2,3]}]'
df = pd.read_json(data, dtype=False, orient='record')
df.to_parquet("test.parquet", engine="fastparquet", object_encoding="json", stats=False)
df2 = pd.read_parquet("test.parquet", engine="fastparquet")
print(df2)
print(df2.dtypes)
This outputs:
a b c e
0 1 Hello {'d': 'something'} [1, 2, 3]
a int64
b object
c object
e object
dtype: object
The types aren't great, but it can write and the file is loaded. ✅
Using VSCode parquet-viewer plugin (TypeScript) we can see the loaded data:

The Typescript/Javascript implementation is able to load the file ✅
However, when I try to load this using arrow-json, I seethe following error:
async fn parquet_to_json<T>(data: T) where T: AsyncFileReader + Send + Unpin + 'static {
let builder = ParquetRecordBatchStreamBuilder::new(data)
.await
.unwrap()
.with_batch_size(3);
let file_metadata = builder.metadata().file_metadata();
println!("schema: {:?}", file_metadata.schema_descr());
let stream = builder.build().unwrap();
let results = stream.try_collect::<Vec<_>>().await.unwrap();
let mut out_buf = Vec::new();
let mut writer = LineDelimitedWriter::new(&mut out_buf);
writer
.write_batches(&results)
.expect("could not write batches");
let json_out = String::from_utf8_lossy(&out_buf);
println!("result: {}", json_out);
}
thread 'main' panicked at 'could not write batches: JsonError("data type Binary not supported in nested map for json writer")'
The schema as arrow-rs knows it:
schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "a", repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None }, physical_type: INT64, type_length: 64, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "b", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "c", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, PrimitiveType { basic_info: BasicTypeInfo { name: "e", repetition: Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }] } }
I don't know what the parquet spec days here but basic files are loadable from other implementations, and being able to read files output from pandas must surely be a significant use case.
Related tickets / PRs:
Related ticket: #154
BinaryArray doesn't exist (anymore?) as I only see Binary as a DataType and BYTE_ARRAY in the schema output, so I wasn't sure if this was the same issue.
There was a previous PR for the above ticket: apache/arrow#8971 which was closed. This looks like this also would have failed to do 'the right thing'.
Goal
arrow-json should be able to load parquet files output from python pandas with no dtypes.
Use case
Given the following python code:
This outputs:
The types aren't great, but it can write and the file is loaded. ✅
Using VSCode parquet-viewer plugin (TypeScript) we can see the loaded data:
The Typescript/Javascript implementation is able to load the file ✅
However, when I try to load this using
arrow-json, I seethe following error:The schema as
arrow-rsknows it:I don't know what the parquet spec days here but basic files are loadable from other implementations, and being able to read files output from pandas must surely be a significant use case.
Related tickets / PRs:
Related ticket: #154
BinaryArray doesn't exist (anymore?) as I only see
Binaryas aDataTypeandBYTE_ARRAYin the schema output, so I wasn't sure if this was the same issue.There was a previous PR for the above ticket: apache/arrow#8971 which was closed. This looks like this also would have failed to do 'the right thing'.