Skip to content

PARQUET-1228: Format Structures encryption#613

Merged
nandorKollar merged 10 commits intoapache:encryptionfrom
ggershinsky:p1228-format-structures-encryption
Aug 27, 2019
Merged

PARQUET-1228: Format Structures encryption#613
nandorKollar merged 10 commits intoapache:encryptionfrom
ggershinsky:p1228-format-structures-encryption

Conversation

@ggershinsky
Copy link
Copy Markdown
Contributor

No description provided.

@gszadovszky
Copy link
Copy Markdown
Contributor

To have a passing travis on your feature branch you may use the dev script feature introduced in PARQUET-1490. See https://github.com/apache/parquet-mr/blob/bloom-filter/dev/travis-before_install-bloom-filter.sh as an example.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

@gszadovszky Thanks, but this wont help, because the format pull request is not merged yet in that feature branch,
apache/parquet-format#124
Would appreciate if you could check if this can be accelerated..

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, can you do an autoformat? A lot of weird indenting and trailing whitespace.

}
}
// Serialize and encrypt the structure
TMemoryBuffer thriftMemoryBuffer = new TMemoryBuffer(100);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one implements Closeable, can we wrap a try-with-resources around this? And why 100 buffer size? Maybe move this to a constant, instead of having a magic variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resources sounds good.
100 is the initial alloc. Usually enough for page headers etc, but will auto-grow for larger ones.
Will make it a constant.


private static <T extends TBase<?,?>> T read(InputStream from, T tbase) throws IOException {
private static <T extends TBase<?,?>> T read(InputStream input, T tbase, BlockCipher.Decryptor decryptor, byte[] AAD) throws IOException {
InputStream from;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to make variables like these final

})));
}

InputStream from;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to make variables like these final

}

public static void readFileMetaData(InputStream from, final FileMetaDataConsumer consumer, boolean skipRowGroups) throws IOException {
readFileMetaData(from, consumer, skipRowGroups, (BlockCipher.Decryptor) null, (byte[]) null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit the casting since it doesn't add any value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is in compiler detection of buggy calls to wrong methods (wrong signature), causes by typeless nulls.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caused

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please paste what kind of compiling error you met? Maybe we can set an option in the compiler?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the extra casts, and didn't get any compile error. I'm also curious what was the error message you got.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it compiles, but the reason is to prevent compiler to cast this to an overloaded method with a different object type in place of the decryptor etc. A sort of precaution.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine any scenario where this could happen. I think if there's an overloaded method with different parameters, and the compile can't figure out which method to call, then it would be a compilation error. In that case casting is the way to give hint to the compiler, that you'd like to call the method with BlockCipher.Decryptor and byte[] parameters, but in this case the method invocation is clear without casting too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Ok, looks like we have two options: 1) leave the code as is - the cast helps code readers to understand what is nulled (in this case, that we call methods without encryption); also, it's harmless 2) remove the casts in this and other pull requests.
Please let me know your preference, I'm fine with either option.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer more compact code, and looks like @Fokko thinks the same, so if you don't mind, then let's remove those casts, that are not necessary and don't add any value!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do that.


public static void readFileMetaData(InputStream from, FileMetaDataConsumer consumer) throws IOException {
readFileMetaData(from, consumer, false);
readFileMetaData(from, consumer, (BlockCipher.Decryptor) null, (byte[]) null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit the casting since it doesn't add any value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above.

import java.io.IOException;
import java.io.InputStream;

public interface BlockCipher{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split the Encryptor and Decryptor in separate files and put them in the BlockCipher package. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to do that. Extra package, with two small files; extra file..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping these interfaces in the same file is acceptable, they're not too big.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

Hi Fokko, thanks for the review, much appreciated. Finally, I got some respite from the C++ PRs, now getting to the Java ones.
Regarding autoformat - can do, but please suggest a (Linux cli?) tool for that.

if (null == decryptor) {
from = input;
}
else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please move this line to the end of last line.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but not yet addressed, @ggershinsky could you please apply Junjie's recommendation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, sorry for missing this.

}

public static void readFileMetaData(InputStream from, final FileMetaDataConsumer consumer, boolean skipRowGroups) throws IOException {
readFileMetaData(from, consumer, skipRowGroups, (BlockCipher.Decryptor) null, (byte[]) null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please paste what kind of compiling error you met? Maybe we can set an option in the compiler?

* Encrypts the plaintext.
*
* @param plaintext The plaintext starts at offset 0 of the input value, and fills up the entire byte array.
* @param AAD Encryption AAD (ignored in case of CTR cipher)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to not use abbreviation in the parameter description, but resolve it.
Something like: @param AAD Additional Authenticated Data for the encryption algorithm (ignored in case of CTR cipher)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

* The ciphertext starts at offset 4 and fills up the rest of the returned byte array.
* The ciphertext includes the nonce and (in case of GCM cipher) the tag, as detailed in the
* Parquet Modular Encryption specification.
* @throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document when does it throw IOException.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

* The ciphertext starts at offset 4 and fills up the rest of the input byte array.
* The ciphertext includes the nonce and (in case of GCM cipher) the tag, as detailed in the
* Parquet Modular Encryption specification.
* @param AAD Encryption AAD (ignored in case of CTR cipher)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for Encryptor interface

/**
* Convenience decryption method that reads the length and ciphertext from the input stream.
*
* @param from
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document parameters, return value and exception. Also, I'd recommend use method overloading, and name this method as decrypt too. Eventually the only change between this and the other decrypt method is in the input parameter: the 'from' in this case is an input stream instead of a byte array, but otherwise they do the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

* Parquet Modular Encryption specification.
* @param AAD Encryption AAD (ignored in case of CTR cipher)
* @return plaintext The plaintext starts at offset 0 of the input value, and fills up the entire byte array.
* @throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undocumented throws.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will document

to.write(encryptedBuffer);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line here

@@ -0,0 +1,27 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that in order for this to work, we'll need 3d8eae7 too. @ggershinsky should we rebase this branch onto master?

Copy link
Copy Markdown
Contributor Author

@ggershinsky ggershinsky Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandorKollar please ignore my previous text here:), now I see what you mean. Yes, rebasing this branch onto master sounds good to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandorKollar can I ask you to rebase this branch? Or let me know if you prefer me to create a pull request for that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky done, let me retrigger Travis to see the test results.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that Travis fails because the default JDK version is now 11, #665 fixes it.

@nandorKollar nandorKollar reopened this Aug 9, 2019
@nandorKollar
Copy link
Copy Markdown
Contributor

I think Travis fails, because the branch specific install script clones parquet-format inside the parquet-mr working folder (see this path /home/travis/build/apache/parquet-mr/parquet-format/pom.xml), and Rat check fails because parquet-format/blob/master/.github/PULL_REQUEST_TEMPLATE.md is not excluded (in format it is - that's why format builds are not failing, but in mr no exclude pattern matches this file).

I recommend modifying the encryption specific before_install script with cloning and building parquet-format in build directory (before clone, move to TRAVIS_BUILD_DIR cd $TRAVIS_BUILD_DIR), and at the end going back to the original directory. Something like this:

cd $TRAVIS_BUILD_DIR
git clone https://github.com/apache/parquet-format.git
mvn -f $TRAVIS_BUILD_DIR/apache/parquet-format/pom.xm install -DskipTests
cd -

@nandorKollar
Copy link
Copy Markdown
Contributor

Hmm okay, my bad, TRAVIS_BUILD_DIR is the cloned repository path as per https://docs.travis-ci.com/user/environment-variables/.
Then maybe this?

cd ../..
git clone -b encryption --single-branch https://github.com/apache/parquet-format.git
cd parquet-format
mvn install -DskipTests
cd $TRAVIS_BUILD_DIR

@nandorKollar
Copy link
Copy Markdown
Contributor

Ouch, hit log record limit. When building parquet-format, I recommend adding --batch-mode so that download progress report messages are not emitted.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

yesss. thanks @nandorKollar for helping with this! I'll get working on addressing the comments.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

Hi guys, I think I've addressed all comments, but let me know if I've missed anything.

@nandorKollar
Copy link
Copy Markdown
Contributor

@ggershinsky I think you missed some of the comments from @Fokko about final variables and formatting.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

@nandorKollar I have made these variables final as per the recommendation from @Fokko. Have I missed some variables? I'll check again.
Regarding auto-formatting - I don't know how to do that, that's why I've asked for an advise before. Is there a tool for that? Or a simple "ctrl-i" in Eclipse will do?

@ggershinsky
Copy link
Copy Markdown
Contributor Author

@nandorKollar comments are addressed.
To format the code, I've used the Eclipse formatting tool.

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codewise LGTM

@nandorKollar nandorKollar merged commit 3f2d0e7 into apache:encryption Aug 27, 2019
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Mar 1, 2020
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Jul 15, 2020
gszadovszky pushed a commit that referenced this pull request Jul 29, 2020
* PARQUET-1228: Format Structures encryption (#613)
* PARQUET-1286: Crypto package (#614)
* PARQUET-1818: Fix bloom/encryption collision in format-structures (#771)
* PARQUET-1817: Crypto Properties Factory (#769)
* PARQUET-1229: Parquet MR encryption (#776)
* PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782)
* PARQUET-1373: Encryption key tools (#615)

Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com>
Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants