Describe the bug, including details regarding any error messages, version, and platform.
ByteStreamSplitDecoder is initialized
|
void ByteStreamSplitDecoder<DType>::SetData(int num_values, const uint8_t* data, |
with a
num_values parameter. The code assumes that this is the number of non-null values but in fact this is the value count including nulls. This breaks decoding in two ways. The first is the check
num_values * static_cast<int64_t>(sizeof(T)) > len which won't be true if there are nulls. The second is that
num_values is used as the stride parameter for decoding the stream instead of the non-null value count, so even if the aforementioned check is removed, the decoding will be wrong. The java code in parquet-mr seems to have similar problems...
I think there are two different approaches to fixing this. The first is to ascertain the non-null value count before decoding; with V2 page headers this can be determined from the header but for V1 headers this might require decoding all levels first. The second approach is to assume that the length of the values data is what it is and divide this length by the element size to get the stride. This seems to be compatible with the way both the C++ and Java encoders write these pages but it probably should be specified in the specification that there is no padding / extraneous byte at the end of the page.
Component(s)
C++, Parquet
Describe the bug, including details regarding any error messages, version, and platform.
ByteStreamSplitDecoder is initialized
arrow/cpp/src/parquet/encoding.cc
Line 2908 in ceec795
num_valuesparameter. The code assumes that this is the number of non-null values but in fact this is the value count including nulls. This breaks decoding in two ways. The first is the checknum_values * static_cast<int64_t>(sizeof(T)) > lenwhich won't be true if there are nulls. The second is thatnum_valuesis used as the stride parameter for decoding the stream instead of the non-null value count, so even if the aforementioned check is removed, the decoding will be wrong. The java code in parquet-mr seems to have similar problems...I think there are two different approaches to fixing this. The first is to ascertain the non-null value count before decoding; with V2 page headers this can be determined from the header but for V1 headers this might require decoding all levels first. The second approach is to assume that the length of the values data is what it is and divide this length by the element size to get the stride. This seems to be compatible with the way both the C++ and Java encoders write these pages but it probably should be specified in the specification that there is no padding / extraneous byte at the end of the page.
Component(s)
C++, Parquet