Skip to content

Parquet - Pagination #934

@norberttech

Description

@norberttech

Parquet pagination can be a tricky one since reader only knows where row groups are starting/ending.
However with properly configured row group/page sizes, we should be able to read group/page metadata from parquet file schema and estimate +/- even for massive files where to start reading.

Obviously bigger row groups would require more memory but in general making a row group big is affecting memory consumption.
Big row groups are perfect for Spark or Hadoop since all calculations are happening in memory which saves I/O. However in case of PHP, smaller groups and smaller pages can help to keep memory consumption under control. Additional I/O related to reading more pages/groups from disk is not a big deal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions