Skip to content

[C++][Parquet] Regression reading byte-stream-split encoded floats with null values in Arrow 16.0.0 #41562

@adamreeve

Description

@adamreeve

Describe the bug, including details regarding any error messages, version, and platform.

Write byte-stream-split encoded floats containing null values:

import pyarrow as pa
import pyarrow.parquet as pq

num_rows = 10
xs = pa.array(
        [None if i % 10 == 5 else (i / 3.14) for i in range(num_rows)],
        type=pa.float32())

table = pa.Table.from_arrays([xs], names=['x'])
pq.write_table(
        table, 'data.parquet',
        use_byte_stream_split=True,
        use_dictionary=False)

And then attempt to read the data back:

import pyarrow as pa
import pyarrow.parquet as pq

table = pq.read_table('data.parquet')
xs = table['x']

num_rows = 10
assert len(xs) == num_rows
for i in range(num_rows):
    value = xs[i]
    if i % 10 == 5:
        assert not value.is_valid
    else:
        assert value.is_valid
        assert value.equals(pa.scalar(i / 3.14, type=pa.float32()))

The above code works with pyarrow 15.0.2 but fails with pyarrow 16.0.0 with the following exception:

Traceback (most recent call last):
  File "/home/adam/dev/parquet-issues/null-byte-stream-split-regression/read_data.py", line 3, in <module>
    table = pq.read_table('data.parquet')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py", line 1811, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py", line 1454, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Data size (36) does not match number of values in BYTE_STREAM_SPLIT (10)

Writing the data with pyarrow 15.0.2 and reading with pyarrow 16.0.0 also fails, but writing with 16.0.0 and reading with 15.0.2 works fine. Disabling byte stream split encoding or not writing any nulls also makes the error go away.

This looks related to #28737 although the error there was quite different.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions