Skip to content

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset #39965

@ion-elgreco

Description

@ion-elgreco

Describe the bug, including details regarding any error messages, version, and platform.

The pyarrow.dataset.write_dataset function leaves an empty row_group behind in the the parquet file after the writer hits the limit of max_rows_per_file. See reproducible example below:

import pyarrow.dataset as ds
import pyarrow.parquet as pq
ds.write_dataset(data, "test_dataset", max_rows_per_file=1024*32, max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, format='parquet')
metadata = pq.read_metadata("test_dataset/part-0.parquet")
for i in range(metadata.num_row_groups):
    print(metadata.row_group(i))

<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5168180>
  num_columns: 1
  num_rows: 16384
  total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
  num_columns: 1
  num_rows: 16384
  total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
  num_columns: 1
  num_rows: 0
  total_byte_size: 14

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions