Skip to content

Parquet - writing in batches #689

@norberttech

Description

@norberttech

Right now Writer has only one way of writing parquet files:

Writer::write(string $path, Schema $schema, iterable $rows)

With this approach, we can pass \Generator, and internally RowGroupBuilder will keep reading from it, splitting data into RowGroups by their size.
Once all data is written writer closes the file writing all metadata at the end of it.

This approach does not let us fully integrate Writer with ETL Loaders because load method is executed only for a given chunk of Rows, so we would need to keep appending data into the file but that would create too many and too small RowGroups which would affect parquet reader performance.

What we need is a method that will not close the parquet file and that would let us keep adding more batches (internal RowGroupBuilder should work exactly as it's working now).
Once all data is saved user should run close() method that will save the parquet file metadata at the end of the file closing also file stream.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions