Skip to content

Incorrect query results after performing file-level data deletion on a table partitioned by columns of binary type #15128

@hantangwangd

Description

@hantangwangd

Apache Iceberg version

1.10.1 (latest release)

Query engine

None

Please describe the bug 🐞

Executing the following statements in Spark (on Iceberg) leads to a mismatch between the actual and expected query results:

CREATE TABLE test_table (id bigint NOT NULL, data binary) USING iceberg PARTITIONED BY (data);
INSERT INTO TABLE test_table VALUES(1, X'e3bcd1'), (2, X'bcd1');
DELETE FROM test_table WHERE data = X'bcd1';
SELECT * FROM test_table where data = X'e3bcd1';

The expected result is the remaining data row, but the query returns empty. Upon investigation, this is because the partition bounds for the binary column in the newly generated manifest file are computed incorrectly, causing the corresponding data file to be pruned during the planning phase.

PrestoDB encounters the same issue when using DeleteFiles.deleteFromRowFilter to support file-level deletion.

To dig deeper, the root cause is that, when calling DeleteFiles.deleteFromRowFilter, the PartitionFieldStats's min/max fields directly reference a reusable byte array. Specifically, this array can be (and is) reused by the ManifestReader when processing multiple files.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions