-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667
Copy link
Copy link
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Platform MacOs 14.5 (23F79)
Version: 15.0.2 and 16.1.0.
import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
data = {
'column1': [1, 2, None],
'column2': ['a', None, 'c']
}
schema = pa.schema([
pa.field('column1', pa.int64(), nullable=True),
pa.field('column2', pa.string(), nullable=False) # make column2 not nullable
])
table = pa.Table.from_pydict(data, schema=schema) # set up table with data that doesn't match the schema
assert table.schema.equals(schema)
print('table before writing \n')
print(table.to_pandas())
pq.write_table(table, 'output.parquet')
table = pq.read_table('output.parquet')
print('table after writing and reading \n')
print(table.to_pandas())yields
table before writing
column1 column2
0 1.0 a
1 2.0 None
2 NaN c
table after writing and reading
column1 column2
0 1.0 a
1 2.0 c
2 NaN a
which is not correct for column 2.
I would expect this to fail on set up of the table, which is what happens if you replace
table = pa.Table.from_pydict(data, schema=schema)with
dataframe = pd.DataFrame(data)
table = pa.Table.from_pandas(dataframe, schema=schema)Component(s)
Python
Reactions are currently unavailable