Skip to content

feat(table): add support for merge-on-read delete#721

Merged
zeroshade merged 14 commits intoapache:mainfrom
alexandre-normand:alex.normand/merge-on-read-delete
Feb 26, 2026
Merged

feat(table): add support for merge-on-read delete#721
zeroshade merged 14 commits intoapache:mainfrom
alexandre-normand:alex.normand/merge-on-read-delete

Conversation

@alexandre-normand
Copy link
Contributor

@alexandre-normand alexandre-normand commented Feb 12, 2026

delete this

This adds support for merge-on-read deletes. It offers an alternative to the copy-on-write to generate position delete files instead of rewriting existing data files.

I'm not very confident in the elegance of my solution as I'm still new to the internals of iceberg-go but the high-level is:

  • Reuse the classification code from the existing delete implementation to get the list of files of dropped files vs files with partial deletes
  • Reuse the arrow scanning facilities to filter records from the data files with partial deletes and emit position delete records with file path and position.
    • This is done by reusing the pipeline code and function and making the first stage in the pipeline one to enrich the RecordBatch with the file Path and position before the original position is lost due to filtering.
    • After filtering, the RecordBatch is projected to the position delete schema (i.e. the original schema fields are dropped)
  • Once we have filtered PositionDelete records that need to be emitted, we reuse the record to file writing to generate position delete files.

Testing

Integration tests were added to exercise the partitioned and unpartitioned paths and the data is such that it's meant to actually produce a position delete file rather than just go through the quick path that drops an entire file because all records are gone.

Indirect fixes

While working on this change and adding the testing for the partitioned table deletions, I realized that the manifest evaluation when the filter affected a field that was part of a partition spec was not built correctly. It needed to use similar code as what's done during scanning to build projections and build a manifest evaluator per partition id. This is fixed in this PR but this technically also applies to copy-on-write and overwrite paths so the fix goes beyond the scope of the merge-on-read.

Fixes #487.

@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 5 times, most recently from 5079248 to 114fc57 Compare February 12, 2026 23:42
@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 5 times, most recently from c9e30c4 to a7a7ce6 Compare February 15, 2026 06:45
@alexandre-normand alexandre-normand marked this pull request as ready for review February 15, 2026 06:46
@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from a7a7ce6 to 6d41651 Compare February 16, 2026 19:35
Copy link
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks correct (I mostly compared it with Iceberg Java), but I think a couple of things need attention:

  • panic/recover is used in normal write-path error handling;
  • position-delete fanout needs additional tests;
  • missing focused local unit tests for invariant-heavy code (enrichRecordsWithPosDeleteFields, position-delete fanout).

Regarding the API change: the ToDataFile change is internal (package scope), but the ManifestWriter.ToManifestFile signature change is public and should be explicitly called out in the compatibility/release notes.

I would advocate for not changing this API and instead adding a new one with the extra argument, but this decision, in my opinion, is not critical.

@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 8 times, most recently from ec1a905 to abb1cf4 Compare February 21, 2026 04:38
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good to me. There's a conflict to resolve in README.md, and I'll wait for @laskoviymishka to give feedback before merging.

@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from 83c50c6 to cf4aaa8 Compare February 25, 2026 02:50
@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 2 times, most recently from a27e3da to aaeb40b Compare February 25, 2026 22:30
@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from aaeb40b to 1311f05 Compare February 25, 2026 22:34
@alexandre-normand
Copy link
Contributor Author

Here's the updated status of the TODO list:

Blockers (must fix before merge)

  • JSON syntax error in TestDeleteMergeOnReadPartitioned
    Missing comma in array.TableFromJSON.
    The partitioned integration test currently panics and never runs.
  • summary() does not account for positionDeleteFiles (snapshot_producers.go)
    appendPositionDeleteFile() was added, but summary() was not updated.
    MoR delete snapshots currently report zero delete metrics.

Follow-ups (can be addressed separately)

  • Partition routing via path.Split instead of (spec, partition) pair
    (writePositionDeletesForFiles, processBatch)
  • Position delete files always written under CurrentSpec()
    (positionDeleteRecordsToDataFiles)
    At minimum, should be documented as safe only for single-spec tables.
  • writeUUID not threaded from commitUUID
    (writePositionDeletesForFiles)
  • dropFile=true sentinel batch uses data schema instead of PositionalDeleteArrowSchema
    (producePosDeletesFromTask)
    Harmless today, but a latent schema mismatch.
  • No sort by (file_path, pos)
    Performance gap vs Java’s SortedPositionDeleteWriter.
  • referenced_data_file not populated in delete manifest entries
    Optional spec optimization.

Thanks for the in-depth review, @laskoviymishka .

  • For the lack of sorting, this would be a bigger effort because arrow-go doesn't yet have support for sorting (related issue which, I think, would be the right place to put that code in.

  • For the referenced_data_file, I'm going to call this out of scope for this one. I'm a little out of breath from this work and the deep in the trenches hunting of the long tail of missing little pieces and I think this is probably a good milestone to call this one done?

Copy link
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the latest changes — the core logic looks correct:

  • MoR positional delete pipeline ordering is sound (positions assigned before filtering, remain valid through projection)
  • Manifest evaluator fix for partition spec evolution is a genuine correctness improvement, applies to CoW paths too
  • pos type fix (Int32 → Int64) aligns with the Iceberg spec

Two minor issues noted but will address in follow-up PRs to keep this moving:

  1. Arrow array refcount leak in enrichRecordsWithPosDeleteFields — NewArray() results are never Release()d after NewRecordBatch retains them
  2. Goroutine leak from iter.Pull in the partitioned path of positionDeleteRecordsToDataFiles — stopCount not called in the partitioned branch (same pattern from #718)

Neither blocks correctness for typical use. LGTM to merge.
:shipit:

@zeroshade
Copy link
Member

Thanks to both of you!

@zeroshade zeroshade merged commit 8f3c302 into apache:main Feb 26, 2026
13 checks passed
@alexandre-normand
Copy link
Contributor Author

Thanks for the review, @laskoviymishka and @zeroshade !

laskoviymishka added a commit to laskoviymishka/iceberg-go that referenced this pull request Feb 28, 2026
…teFields

Arrays returned by NewArray() have refcount=1. NewRecordBatch calls
Retain() on each column, bumping to refcount=2. Without an explicit
Release() on the temporary arrays, the count never drops back to 1
when the record batch is released by the caller.

Fix by assigning NewArray() results to local variables and deferring
their Release(), so the lifecycle is: NewArray() -> refcount 1,
NewRecordBatch Retain() -> refcount 2, deferred Release() -> refcount 1
(owned by outData), caller releases outData -> refcount 0 -> freed.

Also extend TestEnrichRecordsWithPosDeleteFields to use
memory.NewCheckedAllocator with mem.AssertSize(t, 0) to catch this
class of leak going forward.

Fixes leak introduced in apache#721.
ferhatelmas added a commit to ferhatelmas/apache-iceberg-go that referenced this pull request Mar 2, 2026
related to apache#721

* remove premature decoder close in the constructor
so that reader can actually read the entries

* add explicit close method for resource cleanup

* call close in ReadManifest to prevent leak

* add zstd codec based regression test

Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Comment on lines +608 to +610
defer func() {
_ = dec.Close()
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a regression here to fix the leak #766

ferhatelmas added a commit to ferhatelmas/apache-iceberg-go that referenced this pull request Mar 2, 2026
related to apache#721

* remove premature decoder close in the constructor
so that reader can actually read the entries

* add explicit close method for resource cleanup

* call close in ReadManifest to prevent leak

* add zstd codec based regression test

Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
ferhatelmas added a commit to ferhatelmas/apache-iceberg-go that referenced this pull request Mar 2, 2026
related to apache#721

* remove premature decoder close in the constructor
so that reader can actually read the entries

* add explicit close method for resource cleanup

* call close in ReadManifest to prevent leak

* add zstd codec based regression test

Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
ferhatelmas added a commit to ferhatelmas/apache-iceberg-go that referenced this pull request Mar 2, 2026
related to apache#721

* remove premature decoder close in the constructor
so that reader can actually read the entries

* add explicit close method for resource cleanup

* call close in ReadManifest to prevent leak

* add zstd codec based regression test

Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
zeroshade pushed a commit that referenced this pull request Mar 2, 2026
related to #721

* remove premature decoder close in the constructor so that reader can
actually read the entries

* add explicit close method for resource cleanup

* call close in ReadManifest to prevent leak

* add zstd codec based regression test

Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
}

func (p *positionDeletePartitionedFanoutWriter) partitionPath(partitionContext partitionContext) (string, error) {
data := partitionRecord(slices.Collect(maps.Values(partitionContext.partitionData)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think order matters here: #767

zeroshade pushed a commit that referenced this pull request Mar 4, 2026
Arrays returned by NewArray() have refcount=1. NewRecordBatch calls
Retain() on each column, bumping to refcount=2. Without an explicit
Release() on the temporary arrays, the count never drops back to 1 when
the record batch is released by the caller.

Fix by assigning NewArray() results to local variables and deferring
their Release(), so the lifecycle is: NewArray() -> refcount 1,
NewRecordBatch Retain() -> refcount 2, deferred Release() -> refcount 1
(owned by outData), caller releases outData -> refcount 0 -> freed.

Also extend TestEnrichRecordsWithPosDeleteFields to use
memory.NewCheckedAllocator with mem.AssertSize(t, 0) to catch this class
of leak going forward.

Fixes leak introduced in #721.
laskoviymishka added a commit to laskoviymishka/iceberg-go that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Write Positional Delete files

4 participants