[Managed Iceberg] Make manifest file writes and commits more efficient#32666
[Managed Iceberg] Make manifest file writes and commits more efficient#32666ahmedabu98 merged 4 commits intoapache:masterfrom
Conversation
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Added as a release blocker because these are update-incompatible changes. Streaming writes are going to be officially supported in 2.60.0 so this should get in with it to avoid breaking pipeline update |
|
assign set of reviewers |
|
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Hi, kindly pin about the status of this PR. Since this is added to 2.60.0, could you please request a expedited review? |
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteToDestinations.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteGroupedRowsToFiles.java
Show resolved
Hide resolved
.../java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteUngroupedRowsToFiles.java
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteGroupedRowsToFiles.java
Show resolved
Hide resolved
.../java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WriteUngroupedRowsToFiles.java
Show resolved
Hide resolved
chamikaramj
left a comment
There was a problem hiding this comment.
Thanks. LGTM.
We can merge after open comments are addressed.
|
Will merge when tests go green |
apache#32666) * group all data files before writing a manifest file * add to changes md * add data file roundtrip equality test * address comments
When writing to Iceberg, we need to write just one manifest file per snapshot.
However, we are currently writing one manifest file per bundle (or one per GIB batch for streaming writes), which is a lot more frequent than needed. In medium/large streaming jobs, we can end up with thousands of extra manifest files. For an Iceberg table, the effect of this inefficiency is felt in two ways:
Solution:
Continue writing bundles/batches to data files, but stop writing manifest files at that frequency. Instead, group data files by destination then write and commit just one manifest file per destination. Essentially, the number of manifest files should be 1-1 with snapshots/commits (currently, it's roughly 1-1 with data files).