[Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls

### Describe the bug, including details regarding any error messages, version, and platform.

ByteStreamSplitDecoder is initialized https://github.com/apache/arrow/blob/ceec7950e8c6e9a63f48d27c284c56938df3598d/cpp/src/parquet/encoding.cc#L2908 with a `num_values` parameter. The code assumes that this is the number of non-null values but in fact this is the value count including nulls. This breaks decoding in two ways. The first is the check `num_values * static_cast<int64_t>(sizeof(T)) > len` which won't be true if there are nulls. The second is that `num_values` is used as the stride parameter for decoding the stream instead of the non-null value count, so even if the aforementioned check is removed, the decoding will be wrong.  The java code in parquet-mr seems to have similar problems...

I think there are two different approaches to fixing this. The first is to ascertain the non-null value count before decoding; with V2 page headers this can be determined from the header but for V1 headers this might require decoding all levels first. The second approach is to assume that the length of the values data is what it is and divide this length by the element size to get the stride. This seems to be compatible with the way both the C++ and Java encoders write these pages but it probably should be specified in the specification that there is no padding / extraneous byte at the end of the page.

### Component(s)

C++, Parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls #15173

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Parquet][C++] ByteStreamSplitDecoder broken in presence of nulls #15173

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions