Skip to content

PARQUET-251: Binary column statistics error when reuse byte[] among rows#197

Closed
SinghAsDev wants to merge 28 commits intoapache:masterfrom
SinghAsDev:PARQUET-251
Closed

PARQUET-251: Binary column statistics error when reuse byte[] among rows#197
SinghAsDev wants to merge 28 commits intoapache:masterfrom
SinghAsDev:PARQUET-251

Conversation

@SinghAsDev
Copy link
Copy Markdown
Contributor

No description provided.

@SinghAsDev
Copy link
Copy Markdown
Contributor Author

@rdblue could you take a look.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Binary.copy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, because the method I was looking at is DictionaryValuesWriter.copy. I note below that this should be added to the binary API.

@kostya-sh
Copy link
Copy Markdown

A possible optimization: FromStringBinary does not need to copy byte array in getBytes() as its byte array is already a copy (String.getBytes()) and cannot be modified outside of parquet code base.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 27, 2015

I think this should implement Julien's suggestion on the JIRA issue: Binary should know the intent of the producer. If the producer doesn't intend to change the binary, then the copy behavior can change. I think this should update the factory methods, fromByteBuffer etc., to signal that intent. I wouldn't use "immutable" because this still violates that definition. Maybe fromReusedByteBuffer and fromConstantByteBuffer?

Don't worry about the naming of getBytes and getBytesUnsafe for now. We'll be able to rename them easily when we agree what they should be named.

@SinghAsDev
Copy link
Copy Markdown
Contributor Author

Thanks for the review guys. I have updated the PR.

@kostya-sh, FromStringBinary extends ByteArrayBackedBinary and it is possible for a consumer to modify the underlying bytes if getBytes() were to return the underlying bytes. To avoid this, having getBytes() provide a copy makes sense.

@kostya-sh
Copy link
Copy Markdown

@SinghAsDev, yes you are right, the consumer can modify the bytes. However given that consumer is always code in parquet-mr library and this code never modifies bytes my proposal is still valid optimization.

The original reason for getBytes() is to handle the case when a producer modifies the underlying bytes.

I quite like the idea to make a producer signal an intent (fromReusedByteBuffer and fromConstantByteBuffer) as it makes it very clear that copying is required to protect from a producer modifying the underlying array, not a consumer.

@kostya-sh
Copy link
Copy Markdown

Also It looks like parquet-mr doesn't always puts a numeric version into file metadata.

E.g for version 1.6.0rc7, the producer string looks like parquet-mr (build ec6f200b4943cfcbc8be5a8e53fdebf0). 1.6.0 also doesn't put the numeric version.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 28, 2015

@kostya-sh, the commit hash behavior happens in real releases, not just RCs? If so, that makes this significantly harder.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the intent is captured correctly here. Avro objects like Fixed and Utf8 will be reused if reading from an Avro file and writing to Parquet. In that case, shouldn't these methods use Binary.fromReusedByteArray and similar?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point @rdblue. I think you are correct and these should instead be fromReusedByteBuffer. However, in the case of UTF8, we can still use fromUnmodifiedByteBuffer as it passes copy of the buffer, utf8.getBytes().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on digging a bit more, looks like Utf8 can be constructed by passing a string, in which case it keeps a copy of bytes, or passing byte[], in which case it keeps the original byte[]. I think this should be fixed on avro side to have consistent behavior for getBytes(). Anyways, for now I think it would be better and safe to use fromResusedByteArray for Utf8 as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more questions. Reading from Avro file is not controlled by parquet project, right? I could not find code related to reading Avro file here. If I am missing something, please point me to the right place. Assuming that reading is not controlled by parquet project, I think it would be safe to assume that objects are being re-used by Avro file reader. With this thought I am changing even the fromUnmodifiedByteBuffer to fromReusedByteBuffer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: there is no code in Parquet that does the reading from Avro as I describe. I'm thinking of the common use case of copying from Avro to Parquet for a conversion. Avro will reuse objects to avoid creation and garbage collection overheads, so we need to assume that the objects passed in are reused. I'm all for making this kind of improvement to Avro, too, but right now I think the right thing to do is assume that the objects are reused and wrap them accordingly.

Other object models should use similar logic. On the write side, we don't know whether the objects are being reused, so we should assume that they are unless they claim to be Immutable.

@SinghAsDev
Copy link
Copy Markdown
Contributor Author

@kostya-sh I agree with you that FromStringBytes always get a copy of bytes. However, the consumer can still modify it. As you mentioned, as of now consumers are not modifying it, but that does not mean it won't be modified in future. May be by mistake. If the consumer does not need to modify the bytes() then it should be using getBytesUnsafe() instead of getBytes(). Let me know if I am missing updating of getBytes() to getBytesUnsafe somewhere in the code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think fromUnmodifiedByteBuffer is clear enough. Unmodified makes it sound like it is not modified when it is created, but might be afterward. I think const or constant gets closer, but what we want is to signal that this will not be reused or rewritten. Immutable is too strong because it has a formal definition in Java.

What about using fromReusableByteBuffer and fromConstantByteBuffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't guarded by a check for fileMetaData being null, but it should be, right? (use of the deprecated constructor).

@isnotinvain
Copy link
Copy Markdown
Contributor

OK, I made some tiny changes here: SinghAsDev#4
If that gets merged, I'm +1 on this PR.

@SinghAsDev @rdblue think this this is ready?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 30, 2015

I'll review when the PR gets merged. @SinghAsDev, please ping me when it is ready,

@SinghAsDev
Copy link
Copy Markdown
Contributor Author

@isnotinvain thanks for making the changes. Looks good!

@rdblue should be ready for your review now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. I just tracked down why semver didn't catch it, but we need to add these methods back and deprecate them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a constructor in a private class, so I think you can safely delete the deprecated constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, removed

@isnotinvain
Copy link
Copy Markdown
Contributor

+1 -- @rdblue ready to merge?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 30, 2015

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants