Describe the bug, including details regarding any error messages, version, and platform.
The pyarrow.dataset.write_dataset function leaves an empty row_group behind in the the parquet file after the writer hits the limit of max_rows_per_file. See reproducible example below:
import pyarrow.dataset as ds
import pyarrow.parquet as pq
ds.write_dataset(data, "test_dataset", max_rows_per_file=1024*32, max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, format='parquet')
metadata = pq.read_metadata("test_dataset/part-0.parquet")
for i in range(metadata.num_row_groups):
print(metadata.row_group(i))
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5168180>
num_columns: 1
num_rows: 16384
total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
num_columns: 1
num_rows: 16384
total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
num_columns: 1
num_rows: 0
total_byte_size: 14
Component(s)
Python