Skip to content

Fix Parquet writer to produce evenly-sized row groups#7734

Closed
marin-ma wants to merge 1 commit intofacebookincubator:mainfrom
marin-ma:write-fixed-rows-rowgroup
Closed

Fix Parquet writer to produce evenly-sized row groups#7734
marin-ma wants to merge 1 commit intofacebookincubator:mainfrom
marin-ma:write-fixed-rows-rowgroup

Conversation

@marin-ma
Copy link
Copy Markdown
Collaborator

@marin-ma marin-ma commented Nov 27, 2023

When setting parquet::DefaultFlushPolicy::rowsInRowGroup_, Velox parquet writer might not produce consistent row groups in the output file. This patch ensures the exact number of rows for each row group is flushed if possible.

This PR makes the following changes:

When a new call to write(data) is made, it first flushes the cached batches if the staging rows or bytes meet the flush condition. Whether flushed or not, it performs the following operations on the input data:

  1. If stagingRows + data->size() <= parquet::DefaultFlushPolicy::rowsInRowGroup_, cache the data and return.
  2. If stagingRows + data->size() > parquet::DefaultFlushPolicy::rowsInRowGroup_, slice the data in a loop so that each block has exactly rowsInRowGroup_ rows, until the remainingData <= rowsInRowGrop_.
    After slicing, the data will be like: | stagingRows + sliced data1 | sliced data2 | ... | (remaining data)
    Except for the remaining data, all data will be flushed before the next slice. Cache the remaining data and return.

Issue: #7733

@netlify
Copy link
Copy Markdown

netlify Bot commented Nov 27, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 130d291
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/656445b56b49fa000846e4b3

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 27, 2023
@marin-ma
Copy link
Copy Markdown
Collaborator Author

With this patch, the row groups generated in the output file are as follows:

row group 1:     RC:1048576 TS:72804882 OFFSET:4
row group 2:     RC:1048576 TS:72791467 OFFSET:32775245
row group 3:     RC:1048576 TS:72787140 OFFSET:65551495
row group 4:     RC:1048576 TS:72822468 OFFSET:98310737
row group 5:     RC:1048576 TS:72822943 OFFSET:131105160
row group 6:     RC:1048576 TS:72809444 OFFSET:163867758
row group 7:     RC:1048576 TS:72812367 OFFSET:196626208
row group 8:     RC:1048576 TS:72818130 OFFSET:229397754
row group 9:     RC:1048576 TS:72818842 OFFSET:262175626
row group 10:    RC:1048576 TS:72810621 OFFSET:294959668
row group 11:    RC:1048576 TS:72787619 OFFSET:327777654
row group 12:    RC:1048576 TS:72809069 OFFSET:360544626
row group 13:    RC:1048576 TS:72823618 OFFSET:393311032
row group 14:    RC:1048576 TS:72819293 OFFSET:426117932
row group 15:    RC:1048576 TS:72823097 OFFSET:458912409
row group 16:    RC:1048576 TS:72824384 OFFSET:491689520
row group 17:    RC:1048576 TS:72828216 OFFSET:524490944
row group 18:    RC:1048576 TS:72819588 OFFSET:557267306
row group 19:    RC:1048576 TS:72815294 OFFSET:590077215
row group 20:    RC:1048576 TS:72822598 OFFSET:622866176
row group 21:    RC:1048576 TS:72831177 OFFSET:655643910
row group 22:    RC:1048576 TS:72809161 OFFSET:688410312
row group 23:    RC:1048576 TS:72814410 OFFSET:721209873
row group 24:    RC:1048576 TS:72812891 OFFSET:753994847
row group 25:    RC:1048576 TS:72812006 OFFSET:786783787
row group 26:    RC:1048576 TS:72815289 OFFSET:819560617
row group 27:    RC:1048576 TS:72819100 OFFSET:852365581
row group 28:    RC:1048576 TS:72818776 OFFSET:885155164
row group 29:    RC:1048576 TS:72813158 OFFSET:917951654
row group 30:    RC:841376 TS:58670551 OFFSET:950747217

@marin-ma
Copy link
Copy Markdown
Collaborator Author

@mbasmanova Could you help to review? Thanks!

@mbasmanova mbasmanova changed the title Fix for inconsistent row group sizes in velox parquet writer Fix inconsistent row group sizes in Parquet writer Nov 27, 2023
@mbasmanova mbasmanova changed the title Fix inconsistent row group sizes in Parquet writer Fix Parquet writer to produce evenly-sized row groups Nov 27, 2023
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Rong, thank you for the improvement, but would you explain a bit more the new logic? It will make it easier to review.

@majetideepak Deepak, would you help review this PR?

@pedroerp pedroerp requested a review from yingsu00 November 27, 2023 16:54
@marin-ma
Copy link
Copy Markdown
Collaborator Author

@marin-ma Rong, thank you for the improvement, but would you explain a bit more the new logic? It will make it easier to review.

@majetideepak Deepak, would you help review this PR?

Thanks you! I've included additional explanations in the PR description for clarity.

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Nov 28, 2023

| This patch ensures the exact number of rows for each row group is flushed if possible.

@marin-ma Can you also explain the motivation for the above requirement? It should be fine if each row group in a Parquet file has a slightly different number of rows. I assume a batch to write is usually much smaller than the Row Group row count. Is this assumption invalid in your case?
The downside of this patch is the additional computation of chunking/slicing.
What does the exported Parquet file look like without this change?

@marin-ma
Copy link
Copy Markdown
Collaborator Author

marin-ma commented Nov 28, 2023

| This patch ensures the exact number of rows for each row group is flushed if possible.

@marin-ma Can you also explain the motivation for the above requirement? It should be fine if each row group in a Parquet file has a slightly different number of rows. I assume a batch to write is usually much smaller than the Row Group row count. Is this assumption invalid in your case? The downside of this patch is the additional computation of chunking/slicing. What does the exported Parquet file look like without this change?

@majetideepak Thanks. The row group configuration in the parquet file without this modification was posted in the issue; apologies for any confusion.

The key point here is that the Velox Parquet writer delays writing until the input reaches or exceeds the set row group size. However, when it flushes the cached RecordBatches through the Arrow Parquet writer, and the total rows is larger than the configured row number, the data is divided into two row groups. The first matches the configured row number, while the second comprises the remaining data. This often results in the second row group being significantly smaller than usual.

I assume a batch to write is usually much smaller than the Row Group row count.

I think it's not our case. We set the Row Group row count to an expected value, and Row Group size to a very large value to make sure the "Row Group row count" threshold being reached first. Therefore, the total rows of the cached recordbatches before flush is always >= "Row Group row count". But each recordbatch is small, in our case we set the core::QueryConfig::kPreferredOutputBatchRows to 4096, and "Row Group row count" to 1M.

@majetideepak
Copy link
Copy Markdown
Collaborator

@marin-ma Thank you for the clarification. I missed the description of the issue linked.
The main problem is that the last chunk can be very small leading to a small rowgroup.
This issue is also due to the usage of the WriteTable API that completely writes the input table.
I feel we should simplify the writer code path now that the Arrow Parquet Writer code has been moved to Velox.
With the current PR, there will be two places where the chunking will happen. The new chunking that is added here and the chunking inside WriteTable.
If we instead use the NewRowGroup, WriteColumnChunk APIs, we can simply the code paths significantly.
We can copy the chunking inside TableWrite to flush() and skip flushing the last chunk based on the flush property.

@stale
Copy link
Copy Markdown

stale Bot commented Feb 27, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants