Skip to content

[C++][Parquet] Raise an error when reading Parquet data with invalid repetition levels #45185

@adamreeve

Description

@adamreeve

Describe the bug, including details regarding any error messages, version, and platform.

When looking into #45073 I found that Arrow doesn't raise an error when reading data with invalid repetition levels into Arrow list arrays.

The encryption test files included an int64 list column with leaf-values equal to i * 1,000,000,000,000, where i is the leaf-value index. The repetition level was set to 1 for even leaf indices and 0 for odd indices, meaning the first repetition level was 1 which is invalid. This file is read by PyArrow without any error being raised though, and the first leaf value (0) is skipped:

pyarrow.Table
int64_field: list<int64_field: int64 not null> not null
  child 0, int64_field: int64 not null
----
int64_field: [[[1000000000000,2000000000000],[3000000000000,4000000000000],...,[97000000000000,98000000000000],[99000000000000]]]

I wouldn't expect an error to be raised if reading the raw values and repetition levels with the lower-level Parquet C++ API, but think reading this data as an Arrow list should raise an error.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions