Skip to content

[Go][Parquet] Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer #39870

@matthewmcnew

Description

@matthewmcnew

Describe the bug, including details regarding any error messages, version, and platform.

There does not appear to be an accurate way to identify or estimate the size of the current row group with pqarrow.FileWriter.

RowGroupTotalCompressedBytes()provides the total bytes from created data pages but, when the dictionary page size limit is reached the buffered data pages are flushed and the total size is reset to "0". This means the RowGroupTotalCompressedBytes will only provide the size of pages created after the dictionary page size was reached. Ideally the size the TotalCompressedBytes size should include all created data pages.

RowGroupTotalBytesWritten() will provide the total bytes of DataPages when they are written but, not if the the page is buffered due to the dictionary page still being created. This causes the RowGroupTotalBytesWritten to inaccurately provide a "0" bytes estimate until the dictionary page size limit is reached.

Perhaps related to: #39789.

Component(s)

Go

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions