-
-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Describe the Proposal
Currently ParquetWritter is quite heavy on memory. The reason for that is related to how the writer works.
Right now whenever we are writing rows ParquetWriter internally initializes a RowGroup, then it keeps adding to this Row Group until the row group size is greater or equal configured size.
When it reaches the expected size it flushes the Row Group (writes it down).
API Adjustments
To improve memory consumption we need to take a bit different approach.
Instead of holding entire RowGroup (Row Group has exactly one Column Chunk for each column and multiple Pages per column chunk) at once in memory we should hold only one Page at time in memory.
Whenever page size is reached it should be flushed.
This means that instead of keeping rows in memory we should aim to keep in memory only one page at time, and instead of holding in memory raw values we can pretty much write them to the binary buffer (which will also be compressed when we flush page).
Then we should replicate the same behavior on the ColumChunk Builder level.
So whenever we flush the page it should be written to a column chunk binary buffer and compressed. This means that we keep in memory only compressed pages before we reach the total row group size and flush all column chunks to file closing row group.
Just by moving to writing data right away to binary buffers instead of keeping raw rows in memory will:
- simplify checking if the page / row group size limit is reached as it will be simple strlen on buffer (or we can even keep a record of written bytes)
- reduce significantly the amount of data we store in memory
Which might be a really good first step
Are you intenting to also work on proposed change?
Yes
Are you interested in sponsoring this change?
None
Integration & Dependencies
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status