-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
When looking into #45073 I found that Arrow doesn't raise an error when reading data with invalid repetition levels into Arrow list arrays.
The encryption test files included an int64 list column with leaf-values equal to i * 1,000,000,000,000, where i is the leaf-value index. The repetition level was set to 1 for even leaf indices and 0 for odd indices, meaning the first repetition level was 1 which is invalid. This file is read by PyArrow without any error being raised though, and the first leaf value (0) is skipped:
pyarrow.Table
int64_field: list<int64_field: int64 not null> not null
child 0, int64_field: int64 not null
----
int64_field: [[[1000000000000,2000000000000],[3000000000000,4000000000000],...,[97000000000000,98000000000000],[99000000000000]]]
I wouldn't expect an error to be raised if reading the raw values and repetition levels with the lower-level Parquet C++ API, but think reading this data as an Arrow list should raise an error.
Component(s)
C++, Parquet
Reactions are currently unavailable