Skip to content

fix(parquet): Max target file size not working#16389

Closed
PingLiuPing wants to merge 2 commits intofacebookincubator:mainfrom
PingLiuPing:lp_add_parquet_target_file_size
Closed

fix(parquet): Max target file size not working#16389
PingLiuPing wants to merge 2 commits intofacebookincubator:mainfrom
PingLiuPing:lp_add_parquet_target_file_size

Conversation

@PingLiuPing
Copy link
Copy Markdown
Collaborator

Previously, writer rotation did not behave as expected when max-target-file-size was set to a value smaller than the default flush policy threshold.
Parquet only updates write stats on row group flush, so small target sizes were ignored.
For example when max-target-file-size < kDefaultBytesInRowGroup, we could not generating files close to the expected size.
We now cap Parquet’s row group byte threshold so stats advance during writes and rotation can happen.

And fixes a side effect where during writer rotation it could leave an empty file (for parquet it is an empty file with 0B size, for dwrf, it is a non-empty header only file).
The root cause is that prior to introducing writer rotation, writers were created strictly in response to actual input. After enabling rotation, closing a full writer would immediately create a new writer, regardless of whether more input was available. If no additional data arrived, it creates an empty file.
This fix makes the rotation lazy: after closing a full file it only advance the file sequence number but wait to open the next writer until new input arrives, which avoids creating an empty file.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 13, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit fc7e533
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69a55f31fb64e700083fa166

@FelixYBW
Copy link
Copy Markdown

is it related to #15751 fix?

@PingLiuPing PingLiuPing changed the title fix(parquet): Max target file size now working fix(parquet): Max target file size now working Feb 13, 2026
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

is it related to #15751 fix?

@FelixYBW No, this PR fixes a scenario that when max-target-file-size < kDefaultBytesInRowGroup, the writer rotation does not working and hence it cannot generate a file that close to max-target-file-size.

@PingLiuPing PingLiuPing changed the title fix(parquet): Max target file size now working fix(parquet): Max target file size not working Feb 16, 2026
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@prashantgolash Could you help take a look at this part?

And fixes a side effect where during writer rotation it could leave an empty file (for parquet it is an empty file with 0B size, for dwrf, it is a non-empty header only file).

However, dwrf reader is able to read "non-empty header only file" without error, is this behavior (empty file) intentional by design?

@prashantgolash
Copy link
Copy Markdown
Contributor

prashantgolash commented Feb 17, 2026

@prashantgolash Could you help take a look at this part?

And fixes a side effect where during writer rotation it could leave an empty file (for parquet it is an empty file with 0B size, for dwrf, it is a non-empty header only file).

However, dwrf reader is able to read "non-empty header only file" without error, is this behavior (empty file) intentional by design?

I earlier thought about this but it worked for DWRF as you find and I realized this is corner case. Looks like for parquet we may need to fix this. Thanks for making the change. I will look into the diff in more detail by tomorrow EOD

Copy link
Copy Markdown
Contributor

@prashantgolash prashantgolash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm overall, have a comment around ensureFiles property

Copy link
Copy Markdown
Contributor

@prashantgolash prashantgolash Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For case when ensureFiles is true, we will unconditionally create Writer. I think writer close should write footer file but that is not explicit from Writer contract. What will be behavior with parquet writer? Do we need to check for this also in finalizeWriter path to make sure this is explicitly handled.

The property intends to have files. Per comment

  // When this option is set the HiveDataSink will always write a file even
      // if there's no data. This is useful when the table is bucketed, but the
      // engine handles ensuring a 1 to 1 mapping from task to bucket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we need to rebase this on your other diff?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @prashantgolash

What will be behavior with parquet writer?

For parquet, it will generate an empty file (0 bytes).

Do we need to check for this also in finalizeWriter path to make sure this is explicitly handled.

I think I understand your point here, I'm trying to figure out if this is valid requirement for parquet. I have a fix for this to make it align with dwrf behaviour, basically I'm passing ensureFiles to writer option and during parquet writer close(), check this option together with other states and then write the metadata to parquet file.

How about land this PR first and I'll submit another PR to fix this if necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue to track this #16527

@@ -1049,28 +1032,10 @@ uint32_t HiveDataSink::appendWriter(const HiveWriterId& id) {
std::move(writerPool),
std::move(sinkPool),
std::move(sortPool)));
// Set current file names for the initial file (sequence 0).
auto& newInfo = writerInfo_.back();
newInfo->currentWriteFileName = newInfo->writerParameters.writeFileName();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the file name? We don't need sequence number in this case

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Thanks for the comment.
When sequence number is 0, makeSequencedFileName returns original file name.

std::string makeSequencedFileName(
const std::string& filename,
uint32_t sequenceNumber) {
if (sequenceNumber == 0) {
return filename;
}

@PingLiuPing PingLiuPing force-pushed the lp_add_parquet_target_file_size branch from fc6e786 to fc7e533 Compare March 2, 2026 09:58
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@Yuhta Could you help to take another look at your convenience? Thank you.

Copy link
Copy Markdown
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional Behavioral Changes

  1. No empty trailing files after rotation: After rotateWriter(), the writer is nullptr until the next write() call. If no more data arrives for that partition, no file is created. Previously, the next writer/file was created eagerly in rotateWriter(), potentially producing an empty file.

  2. Empty files excluded from writtenFiles metadata: finalizeWriterFile() now skips recording HiveFileInfo when currentFileBytes == 0. Previously, a zero-byte file entry would be included in the commit metadata.

No Unintended Behavioral Changes

Concern Analysis Verdict
File naming for initial file makeSequencedFileName(name, 0) returns name unchanged — identical to the old code which used writerParameters.writeFileName() directly Safe
createWriterOptions() vs createWriterOptions(index) The no-arg overload is just return createWriterOptions(writerInfo_.size() - 1). In appendWriter(), the writer was just pushed, so size() - 1 equals the new writer's index. Both paths resolve to the same indexed overload with the same index. Safe
Non-reclaimable guard on lazy writer creation Old rotateWriter() created the new writer without a non-reclaimable guard. New path creates it inside write() which already holds WRITER_NON_RECLAIMABLE_SECTION_GUARD(index), plus createWriterForIndex() adds its own. The macro creates a local RAII variable (nonReclaimableGuard) — the nested one in createWriterForIndex shadows the outer one but both reference the same nonReclaimableSectionHolder, so this is harmless. This is actually more protective than the old code. Safe (improved)
Writer creation timing / memory After rotation, memory allocation for the new writer is deferred until the next write(). This changes when memory is consumed but not the total amount. Minor timing shift, no correctness issue

Summary

The only behavioral changes are the two intentional ones: no empty trailing files after rotation, and no zero-byte entries in commit metadata. The refactoring (createWriterForIndex, maybeCreateBucketSortWriter taking an index) is behavior-preserving.

@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 3, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 3, 2026

@kKPulla has imported this pull request. If you are a Meta employee, you can view this in D95063363.

@meta-codesync meta-codesync Bot closed this in e4a431e Mar 4, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 4, 2026

@kKPulla merged this pull request in e4a431e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants