Fix Parquet writer to produce evenly-sized row groups#7734
Fix Parquet writer to produce evenly-sized row groups#7734marin-ma wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
With this patch, the row groups generated in the output file are as follows: |
|
@mbasmanova Could you help to review? Thanks! |
mbasmanova
left a comment
There was a problem hiding this comment.
@marin-ma Rong, thank you for the improvement, but would you explain a bit more the new logic? It will make it easier to review.
@majetideepak Deepak, would you help review this PR?
Thanks you! I've included additional explanations in the PR description for clarity. |
|
| This patch ensures the exact number of rows for each row group is flushed if possible. @marin-ma Can you also explain the motivation for the above requirement? It should be fine if each row group in a Parquet file has a slightly different number of rows. I assume a batch to write is usually much smaller than the Row Group row count. Is this assumption invalid in your case? |
@majetideepak Thanks. The row group configuration in the parquet file without this modification was posted in the issue; apologies for any confusion. The key point here is that the Velox Parquet writer delays writing until the input reaches or exceeds the set row group size. However, when it flushes the cached RecordBatches through the Arrow Parquet writer, and the total rows is larger than the configured row number, the data is divided into two row groups. The first matches the configured row number, while the second comprises the remaining data. This often results in the second row group being significantly smaller than usual.
I think it's not our case. We set the Row Group row count to an expected value, and Row Group size to a very large value to make sure the "Row Group row count" threshold being reached first. Therefore, the total rows of the cached recordbatches before flush is always >= "Row Group row count". But each recordbatch is small, in our case we set the |
|
@marin-ma Thank you for the clarification. I missed the description of the issue linked. |
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
When setting
parquet::DefaultFlushPolicy::rowsInRowGroup_, Velox parquet writer might not produce consistent row groups in the output file. This patch ensures the exact number of rows for each row group is flushed if possible.This PR makes the following changes:
When a new call to
write(data)is made, it first flushes the cached batches if the staging rows or bytes meet the flush condition. Whether flushed or not, it performs the following operations on the input data:parquet::DefaultFlushPolicy::rowsInRowGroup_, cache the data and return.parquet::DefaultFlushPolicy::rowsInRowGroup_, slice the data in a loop so that each block has exactlyrowsInRowGroup_rows, until the remainingData <=rowsInRowGrop_.After slicing, the data will be like:
| stagingRows + sliced data1 | sliced data2 | ... | (remaining data)Except for the remaining data, all data will be flushed before the next slice. Cache the remaining data and return.
Issue: #7733