Skip to content

[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

@p-ortmann

Description

@p-ortmann

Describe the bug, including details regarding any error messages, version, and platform.

Platform MacOs 14.5 (23F79)
Version: 15.0.2 and 16.1.0.

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
data = {
    'column1': [1, 2, None],
    'column2': ['a', None, 'c']
}

schema = pa.schema([
    pa.field('column1', pa.int64(), nullable=True),
    pa.field('column2', pa.string(), nullable=False)  # make column2 not nullable
])

table = pa.Table.from_pydict(data, schema=schema) # set up table with data that doesn't match the schema
assert table.schema.equals(schema)

print('table before writing \n')
print(table.to_pandas())
pq.write_table(table, 'output.parquet')

table = pq.read_table('output.parquet')

print('table after writing and reading \n')
print(table.to_pandas())

yields

table before writing 

   column1 column2
0      1.0       a
1      2.0    None
2      NaN       c
table after writing and reading 

   column1 column2
0      1.0       a
1      2.0       c
2      NaN       a

which is not correct for column 2.

I would expect this to fail on set up of the table, which is what happens if you replace

table = pa.Table.from_pydict(data, schema=schema)

with

dataframe = pd.DataFrame(data)
table = pa.Table.from_pandas(dataframe, schema=schema)

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions