Skip to content

[C++][Python][Parquet] Files with very large data page header can't be read with pyarrow #46404

@jonded94

Description

@jonded94

Describe the bug, including details regarding any error messages, version, and platform.

Hello,

internally, we wrote an own library that wraps arrow-rs to make it useable from Python.
Such a thing also exists publicly available through arro3 which I used here for some minimal reproducible example:

import pyarrow.parquet
import arro3.io

data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]

t = pyarrow.Table.from_pylist(data, schema=schema)

path = "/tmp/foo.parquet"
with open(path, "wb") as file:
    for b in t.to_batches():
        arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)

reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
    print(len(reader.read_row_group(i)))

This code writes a bit of dummy binary data through arrow-rs. Reading that with pyarrow results in

  File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

Observations

  • Reading in the same file through arro3 or own internal library wrapping arrow-rs works just fine
  • Reading in the same file through duckdb also works just fine
  • Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
  • Issue is reproducible with pyarrow version 18.1.0, 19.0.1 and 20.0.0

Component(s)

C++, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions