Describe the bug, including details regarding any error messages, version, and platform.
Write byte-stream-split encoded floats containing null values:
import pyarrow as pa
import pyarrow.parquet as pq
num_rows = 10
xs = pa.array(
[None if i % 10 == 5 else (i / 3.14) for i in range(num_rows)],
type=pa.float32())
table = pa.Table.from_arrays([xs], names=['x'])
pq.write_table(
table, 'data.parquet',
use_byte_stream_split=True,
use_dictionary=False)
And then attempt to read the data back:
import pyarrow as pa
import pyarrow.parquet as pq
table = pq.read_table('data.parquet')
xs = table['x']
num_rows = 10
assert len(xs) == num_rows
for i in range(num_rows):
value = xs[i]
if i % 10 == 5:
assert not value.is_valid
else:
assert value.is_valid
assert value.equals(pa.scalar(i / 3.14, type=pa.float32()))
The above code works with pyarrow 15.0.2 but fails with pyarrow 16.0.0 with the following exception:
Traceback (most recent call last):
File "/home/adam/dev/parquet-issues/null-byte-stream-split-regression/read_data.py", line 3, in <module>
table = pq.read_table('data.parquet')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py", line 1811, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py", line 1454, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Data size (36) does not match number of values in BYTE_STREAM_SPLIT (10)
Writing the data with pyarrow 15.0.2 and reading with pyarrow 16.0.0 also fails, but writing with 16.0.0 and reading with 15.0.2 works fine. Disabling byte stream split encoding or not writing any nulls also makes the error go away.
This looks related to #28737 although the error there was quite different.
Component(s)
C++, Parquet
Describe the bug, including details regarding any error messages, version, and platform.
Write byte-stream-split encoded floats containing null values:
And then attempt to read the data back:
The above code works with pyarrow 15.0.2 but fails with pyarrow 16.0.0 with the following exception:
Writing the data with pyarrow 15.0.2 and reading with pyarrow 16.0.0 also fails, but writing with 16.0.0 and reading with 15.0.2 works fine. Disabling byte stream split encoding or not writing any nulls also makes the error go away.
This looks related to #28737 although the error there was quite different.
Component(s)
C++, Parquet