Skip to content

[Go][Parquet] Potential inconsistency between TotalBytesWritten tracked by RowGroupWriter and actual bytes written to io.Writer #39789

@joellubi

Description

@joellubi

Describe the bug, including details regarding any error messages, version, and platform.

When using the following props for a ParquetWriter, there is a discrepancy between the sum of RowGroupTotalBytesWritten() for each Write() call and the actual number of bytes seen by the target io.Writer interface.

parquetProps := parquet.NewWriterProperties(
		parquet.WithAllocator(memory.DefaultAllocator),
		parquet.WithCompression(compress.Codecs.Snappy),
		parquet.WithCompressionLevel(flate.DefaultCompression),
		parquet.WithDictionaryDefault(false),
		parquet.WithStats(false),
                parquet.WithMaxRowGroupLength(math.MaxInt64),
	)
arrowProps := pqarrow.NewArrowWriterProperties(pqarrow.WithAllocator(memory.DefaultAllocator))

In this specific case, a 13 MB file had only reported about 10 MB written via RowGroupTotalBytesWritten() calls. Some of the discrepancy can be attributed to metadata that is not included in the row groups, but this likely doesn't explain the entire difference. We should investigate the root cause and either fix it or document the explanation for future users of this API.

Related to arrow-adbc@1456

Component(s)

Go, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions