PARQUET-251: Binary column statistics error when reuse byte[] among rows#197
PARQUET-251: Binary column statistics error when reuse byte[] among rows#197SinghAsDev wants to merge 28 commits intoapache:masterfrom
Conversation
|
@rdblue could you take a look. |
There was a problem hiding this comment.
Ah, because the method I was looking at is DictionaryValuesWriter.copy. I note below that this should be added to the binary API.
|
A possible optimization: |
|
I think this should implement Julien's suggestion on the JIRA issue: Binary should know the intent of the producer. If the producer doesn't intend to change the binary, then the copy behavior can change. I think this should update the factory methods, Don't worry about the naming of |
|
Thanks for the review guys. I have updated the PR. @kostya-sh, FromStringBinary extends ByteArrayBackedBinary and it is possible for a consumer to modify the underlying bytes if getBytes() were to return the underlying bytes. To avoid this, having getBytes() provide a copy makes sense. |
|
@SinghAsDev, yes you are right, the consumer can modify the bytes. However given that consumer is always code in parquet-mr library and this code never modifies bytes my proposal is still valid optimization. The original reason for I quite like the idea to make a producer signal an intent ( |
|
Also It looks like parquet-mr doesn't always puts a numeric version into file metadata. E.g for version 1.6.0rc7, the producer string looks like |
|
@kostya-sh, the commit hash behavior happens in real releases, not just RCs? If so, that makes this significantly harder. |
There was a problem hiding this comment.
I don't think the intent is captured correctly here. Avro objects like Fixed and Utf8 will be reused if reading from an Avro file and writing to Parquet. In that case, shouldn't these methods use Binary.fromReusedByteArray and similar?
There was a problem hiding this comment.
That is a good point @rdblue. I think you are correct and these should instead be fromReusedByteBuffer. However, in the case of UTF8, we can still use fromUnmodifiedByteBuffer as it passes copy of the buffer, utf8.getBytes().
There was a problem hiding this comment.
Actually on digging a bit more, looks like Utf8 can be constructed by passing a string, in which case it keeps a copy of bytes, or passing byte[], in which case it keeps the original byte[]. I think this should be fixed on avro side to have consistent behavior for getBytes(). Anyways, for now I think it would be better and safe to use fromResusedByteArray for Utf8 as well.
There was a problem hiding this comment.
Few more questions. Reading from Avro file is not controlled by parquet project, right? I could not find code related to reading Avro file here. If I am missing something, please point me to the right place. Assuming that reading is not controlled by parquet project, I think it would be safe to assume that objects are being re-used by Avro file reader. With this thought I am changing even the fromUnmodifiedByteBuffer to fromReusedByteBuffer.
There was a problem hiding this comment.
You're right: there is no code in Parquet that does the reading from Avro as I describe. I'm thinking of the common use case of copying from Avro to Parquet for a conversion. Avro will reuse objects to avoid creation and garbage collection overheads, so we need to assume that the objects passed in are reused. I'm all for making this kind of improvement to Avro, too, but right now I think the right thing to do is assume that the objects are reused and wrap them accordingly.
Other object models should use similar logic. On the write side, we don't know whether the objects are being reused, so we should assume that they are unless they claim to be Immutable.
|
@kostya-sh I agree with you that FromStringBytes always get a copy of bytes. However, the consumer can still modify it. As you mentioned, as of now consumers are not modifying it, but that does not mean it won't be modified in future. May be by mistake. If the consumer does not need to modify the bytes() then it should be using getBytesUnsafe() instead of getBytes(). Let me know if I am missing updating of getBytes() to getBytesUnsafe somewhere in the code. |
There was a problem hiding this comment.
I don't think fromUnmodifiedByteBuffer is clear enough. Unmodified makes it sound like it is not modified when it is created, but might be afterward. I think const or constant gets closer, but what we want is to signal that this will not be reused or rewritten. Immutable is too strong because it has a formal definition in Java.
What about using fromReusableByteBuffer and fromConstantByteBuffer?
There was a problem hiding this comment.
Sounds good to me.
There was a problem hiding this comment.
this isn't guarded by a check for fileMetaData being null, but it should be, right? (use of the deprecated constructor).
|
OK, I made some tiny changes here: SinghAsDev#4 @SinghAsDev @rdblue think this this is ready? |
Some minor cleanup
|
I'll review when the PR gets merged. @SinghAsDev, please ping me when it is ready, |
|
@isnotinvain thanks for making the changes. Looks good! @rdblue should be ready for your review now. |
There was a problem hiding this comment.
This is a breaking change. I just tracked down why semver didn't catch it, but we need to add these methods back and deprecate them.
There was a problem hiding this comment.
This is a constructor in a private class, so I think you can safely delete the deprecated constructor.
|
+1 -- @rdblue ready to merge? |
|
+1 |
No description provided.