PARQUET-251: Binary column statistics error when reuse byte[] among rows by SinghAsDev · Pull Request #197 · apache/parquet-java

SinghAsDev · 2015-05-16T20:41:12Z

No description provided.

SinghAsDev · 2015-05-27T04:23:50Z

@rdblue could you take a look.

rdblue · 2015-05-27T16:25:46Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

Why not use Binary.copy?

Ah, because the method I was looking at is DictionaryValuesWriter.copy. I note below that this should be added to the binary API.

kostya-sh · 2015-05-27T16:34:23Z

A possible optimization: FromStringBinary does not need to copy byte array in getBytes() as its byte array is already a copy (String.getBytes()) and cannot be modified outside of parquet code base.

rdblue · 2015-05-27T16:35:12Z

I think this should implement Julien's suggestion on the JIRA issue: Binary should know the intent of the producer. If the producer doesn't intend to change the binary, then the copy behavior can change. I think this should update the factory methods, fromByteBuffer etc., to signal that intent. I wouldn't use "immutable" because this still violates that definition. Maybe fromReusedByteBuffer and fromConstantByteBuffer?

Don't worry about the naming of getBytes and getBytesUnsafe for now. We'll be able to rename them easily when we agree what they should be named.

SinghAsDev · 2015-05-28T01:47:39Z

Thanks for the review guys. I have updated the PR.

@kostya-sh, FromStringBinary extends ByteArrayBackedBinary and it is possible for a consumer to modify the underlying bytes if getBytes() were to return the underlying bytes. To avoid this, having getBytes() provide a copy makes sense.

kostya-sh · 2015-05-28T04:35:48Z

@SinghAsDev, yes you are right, the consumer can modify the bytes. However given that consumer is always code in parquet-mr library and this code never modifies bytes my proposal is still valid optimization.

The original reason for getBytes() is to handle the case when a producer modifies the underlying bytes.

I quite like the idea to make a producer signal an intent (fromReusedByteBuffer and fromConstantByteBuffer) as it makes it very clear that copying is required to protect from a producer modifying the underlying array, not a consumer.

kostya-sh · 2015-05-28T04:45:45Z

Also It looks like parquet-mr doesn't always puts a numeric version into file metadata.

E.g for version 1.6.0rc7, the producer string looks like parquet-mr (build ec6f200b4943cfcbc8be5a8e53fdebf0). 1.6.0 also doesn't put the numeric version.

rdblue · 2015-05-28T18:04:50Z

@kostya-sh, the commit hash behavior happens in real releases, not just RCs? If so, that makes this significantly harder.

rdblue · 2015-05-30T00:00:21Z

parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java

I don't think the intent is captured correctly here. Avro objects like Fixed and Utf8 will be reused if reading from an Avro file and writing to Parquet. In that case, shouldn't these methods use Binary.fromReusedByteArray and similar?

That is a good point @rdblue. I think you are correct and these should instead be fromReusedByteBuffer. However, in the case of UTF8, we can still use fromUnmodifiedByteBuffer as it passes copy of the buffer, utf8.getBytes().

Actually on digging a bit more, looks like Utf8 can be constructed by passing a string, in which case it keeps a copy of bytes, or passing byte[], in which case it keeps the original byte[]. I think this should be fixed on avro side to have consistent behavior for getBytes(). Anyways, for now I think it would be better and safe to use fromResusedByteArray for Utf8 as well.

Few more questions. Reading from Avro file is not controlled by parquet project, right? I could not find code related to reading Avro file here. If I am missing something, please point me to the right place. Assuming that reading is not controlled by parquet project, I think it would be safe to assume that objects are being re-used by Avro file reader. With this thought I am changing even the fromUnmodifiedByteBuffer to fromReusedByteBuffer.

You're right: there is no code in Parquet that does the reading from Avro as I describe. I'm thinking of the common use case of copying from Avro to Parquet for a conversion. Avro will reuse objects to avoid creation and garbage collection overheads, so we need to assume that the objects passed in are reused. I'm all for making this kind of improvement to Avro, too, but right now I think the right thing to do is assume that the objects are reused and wrap them accordingly.

Other object models should use similar logic. On the write side, we don't know whether the objects are being reused, so we should assume that they are unless they claim to be Immutable.

SinghAsDev · 2015-05-30T00:36:39Z

@kostya-sh I agree with you that FromStringBytes always get a copy of bytes. However, the consumer can still modify it. As you mentioned, as of now consumers are not modifying it, but that does not mean it won't be modified in future. May be by mistake. If the consumer does not need to modify the bytes() then it should be using getBytesUnsafe() instead of getBytes(). Let me know if I am missing updating of getBytes() to getBytesUnsafe somewhere in the code.

rdblue · 2015-05-30T00:40:01Z

parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java

I don't think fromUnmodifiedByteBuffer is clear enough. Unmodified makes it sound like it is not modified when it is created, but might be afterward. I think const or constant gets closer, but what we want is to signal that this will not be reused or rewritten. Immutable is too strong because it has a formal definition in Java.

What about using fromReusableByteBuffer and fromConstantByteBuffer?

Sounds good to me.

…nt64

isnotinvain · 2015-06-29T23:34:48Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

this isn't guarded by a check for fileMetaData being null, but it should be, right? (use of the deprecated constructor).

isnotinvain · 2015-06-30T00:13:45Z

OK, I made some tiny changes here: SinghAsDev#4
If that gets merged, I'm +1 on this PR.

@SinghAsDev @rdblue think this this is ready?

Some minor cleanup

rdblue · 2015-06-30T00:21:08Z

I'll review when the PR gets merged. @SinghAsDev, please ping me when it is ready,

SinghAsDev · 2015-06-30T00:32:45Z

@isnotinvain thanks for making the changes. Looks good!

@rdblue should be ready for your review now.

rdblue · 2015-06-30T00:51:47Z

parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java

This is a breaking change. I just tracked down why semver didn't catch it, but we need to add these methods back and deprecate them.

isnotinvain · 2015-06-30T03:37:56Z

parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java

This is a constructor in a private class, so I think you can safely delete the deprecated constructor.

True, removed

isnotinvain · 2015-06-30T03:59:56Z

+1 -- @rdblue ready to merge?

rdblue · 2015-06-30T16:43:05Z

+1

rdblue reviewed May 27, 2015
View reviewed changes

rdblue reviewed May 30, 2015
View reviewed changes

asingh and others added 8 commits June 29, 2015 12:45

Remove redundant junit dependency

857141a

Rebase over latest trunk

d2ad939

Rename isReused => isBackingBytesReused

e00d9b7

Generalize tests, make Binary.fromString reused=false

5af9142

Split out version checks to separate files, add some tests

2838cc9

put the headers in the right location

89ab4ee

Address PR feedback

af43d28

Remove test for stats getting ingnored for version 160 when type is i…

7570035

…nt64

SinghAsDev force-pushed the PARQUET-251 branch from f6cfd76 to 7570035 Compare June 29, 2015 20:08

isnotinvain reviewed Jun 29, 2015
View reviewed changes

Some minor cleanup

9826ee6

Merge pull request #4 from isnotinvain/PR-197-3

fbe873f

Some minor cleanup

Add comment for BinaryStatistics.setMinMaxFromBytes

0e71728

rdblue reviewed Jun 30, 2015
View reviewed changes

Add removed public methods in Binary and deprecate them

67e4e5f

isnotinvain reviewed Jun 30, 2015
View reviewed changes

Remove deprecated constructors from private classes

68e0eae

asfgit closed this in e3b9502 Jul 1, 2015

asfimport mentioned this pull request Jun 23, 2024

Binary column statistics error when reuse byte[] among rows #1433

Closed

Conversation

SinghAsDev commented May 16, 2015

Uh oh!

SinghAsDev commented May 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostya-sh commented May 27, 2015

Uh oh!

rdblue commented May 27, 2015

Uh oh!

SinghAsDev commented May 28, 2015

Uh oh!

kostya-sh commented May 28, 2015

Uh oh!

kostya-sh commented May 28, 2015

Uh oh!

rdblue commented May 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SinghAsDev commented May 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 30, 2015

Uh oh!

rdblue commented Jun 30, 2015

Uh oh!

SinghAsDev commented Jun 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 30, 2015

Uh oh!

rdblue commented Jun 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants