-
-
Notifications
You must be signed in to change notification settings - Fork 48
Closed
Milestone
Description
Parquet pagination can be a tricky one since reader only knows where row groups are starting/ending.
However with properly configured row group/page sizes, we should be able to read group/page metadata from parquet file schema and estimate +/- even for massive files where to start reading.
Obviously bigger row groups would require more memory but in general making a row group big is affecting memory consumption.
Big row groups are perfect for Spark or Hadoop since all calculations are happening in memory which saves I/O. However in case of PHP, smaller groups and smaller pages can help to keep memory consumption under control. Additional I/O related to reading more pages/groups from disk is not a big deal.