GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k by pan3793 · Pull Request #3041 · apache/parquet-java

pan3793 · 2024-11-03T18:19:16Z

Rationale for this change

Fixes the data loss issue that reported in #3040

What changes are included in this PR?

Ensure that StreamBytesInput#writeInto(ByteBuffer buffer) copies data fully, even if the underlying InputStream does not report available correctly.

Are these changes tested?

UTs are added, I also tested it with an internal production data loss case.

Are there any user-facing changes?

Yes, this fixes some data loss cases, and I acknowledge that the bug affects Spark 4.0.0 preview2 which ships Parquet 1.14.2.

Closes #3040

…ult when dict size exceeds 8k

pan3793 · 2024-11-03T18:21:13Z

parquet-common/src/test/java/org/apache/parquet/bytes/AvailableAgnosticInputStream.java

+    super(buf);
+  }
+
+  // In practice, there are some implementations always return 0 even if they has more data


in my case, the underlying IntputStream is H1SeekableInputStream

Out of curiosity. Why are you using H1SeekableInputStream? This one is related to Hadoop 1.

My test code is
(sorry, the file contains private data so I can not share)

public class MyDictionaryFilterTest { private static final Configuration conf = new Configuration(); List<ColumnChunkMetaData> ccmd; ParquetFileReader reader; DictionaryPageReadStore dictionaries; private Path file = new Path("/Users/chengpan/Temp/part-2bb8404a-f6e5-4e9f-9161-f749c4bf46d0-2-2222"); @Before public void setUp() throws Exception { reader = ParquetFileReader.open(conf, file); ParquetMetadata meta = reader.getFooter(); ccmd = meta.getBlocks().get(0).getColumns(); dictionaries = reader.getDictionaryReader(meta.getBlocks().get(0)); } @After public void tearDown() throws Exception { reader.close(); } @Test public void testEqBinary() throws Exception { BinaryColumn b = binaryColumn("source_id"); FilterPredicate pred = eq(b, Binary.fromString("5059661515")); assertFalse(canDrop(pred, ccmd, dictionaries)); } }

Thanks, this is very helpful!

pan3793 · 2024-11-03T18:34:17Z

cc @gszadovszky @wgtmac @Fokko

wgtmac · 2024-11-04T05:25:26Z

parquet-common/src/test/java/org/apache/parquet/bytes/TestBytesInput.java

+    byte[] input = new byte[data.length + 10];
+    RANDOM.nextBytes(input);
+    System.arraycopy(data, 0, input, 0, data.length);
+    Supplier<BytesInput> factory = () -> BytesInput.from(new AvailableAgnosticInputStream(input), 9 * 1024);


What about using an anonymous class here instead of adding a new file?

I tend to use a new file as it is a quite common case that needs to be tested, it might be used in other places in the future.

wgtmac · 2024-11-04T05:26:19Z

parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java

+        ReadableByteChannel channel = Channels.newChannel(in);
+        int remaining = byteCount;
+        while (remaining > 0) {
+          remaining -= channel.read(workBuf);


Is remaining reliable? Should we check the return value of channel.read(workBuf)?

added a check to detect EOF case

wgtmac

+1. Thanks @pan3793!

Fokko

Great catch @pan3793

ConeyLiu · 2024-11-05T12:14:26Z

parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java

        ByteBuffer workBuf = buffer.duplicate();
        int pos = buffer.position();
        workBuf.limit(pos + byteCount);
-        Channels.newChannel(in).read(workBuf);


Is there any other place that is used like this?

I went through the original PR and found nothing else, it would be great if others have a double check.

Fokko · 2024-11-06T20:34:59Z

Thanks @pan3793 for finding and fixing this, and thanks @wgtmac @ConeyLiu and @gszadovszky for the review 🙌

…en dict size exceeds 8k (#3041) * GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k * style * check bytesRead * import

pan3793 added 2 commits November 4, 2024 02:11

apacheGH-3040: DictionaryFilter.canDrop may return false positive res…

4564c58

…ult when dict size exceeds 8k

style

047442b

pan3793 commented Nov 3, 2024

View reviewed changes

wgtmac reviewed Nov 4, 2024

View reviewed changes

pan3793 added 2 commits November 4, 2024 13:52

check bytesRead

643206b

import

2d77e55

wgtmac approved these changes Nov 4, 2024

View reviewed changes

Fokko approved these changes Nov 4, 2024

View reviewed changes

Fokko added this to the 1.14.4 milestone Nov 4, 2024

gszadovszky approved these changes Nov 4, 2024

View reviewed changes

ConeyLiu reviewed Nov 5, 2024

View reviewed changes

Fokko merged commit 2e765cc into apache:master Nov 6, 2024

Conversation

pan3793 commented Nov 3, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Nov 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Nov 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants