PARQUET-1228: Format Structures encryption#613
PARQUET-1228: Format Structures encryption#613nandorKollar merged 10 commits intoapache:encryptionfrom
Conversation
|
To have a passing travis on your feature branch you may use the dev script feature introduced in PARQUET-1490. See https://github.com/apache/parquet-mr/blob/bloom-filter/dev/travis-before_install-bloom-filter.sh as an example. |
|
@gszadovszky Thanks, but this wont help, because the format pull request is not merged yet in that feature branch, |
Fokko
left a comment
There was a problem hiding this comment.
LGTM in general, can you do an autoformat? A lot of weird indenting and trailing whitespace.
| } | ||
| } | ||
| // Serialize and encrypt the structure | ||
| TMemoryBuffer thriftMemoryBuffer = new TMemoryBuffer(100); |
There was a problem hiding this comment.
This one implements Closeable, can we wrap a try-with-resources around this? And why 100 buffer size? Maybe move this to a constant, instead of having a magic variable.
There was a problem hiding this comment.
try-with-resources sounds good.
100 is the initial alloc. Usually enough for page headers etc, but will auto-grow for larger ones.
Will make it a constant.
|
|
||
| private static <T extends TBase<?,?>> T read(InputStream from, T tbase) throws IOException { | ||
| private static <T extends TBase<?,?>> T read(InputStream input, T tbase, BlockCipher.Decryptor decryptor, byte[] AAD) throws IOException { | ||
| InputStream from; |
There was a problem hiding this comment.
I like to make variables like these final
| }))); | ||
| } | ||
|
|
||
| InputStream from; |
There was a problem hiding this comment.
I like to make variables like these final
| } | ||
|
|
||
| public static void readFileMetaData(InputStream from, final FileMetaDataConsumer consumer, boolean skipRowGroups) throws IOException { | ||
| readFileMetaData(from, consumer, skipRowGroups, (BlockCipher.Decryptor) null, (byte[]) null); |
There was a problem hiding this comment.
I would omit the casting since it doesn't add any value.
There was a problem hiding this comment.
The value is in compiler detection of buggy calls to wrong methods (wrong signature), causes by typeless nulls.
There was a problem hiding this comment.
Could you please paste what kind of compiling error you met? Maybe we can set an option in the compiler?
There was a problem hiding this comment.
I removed the extra casts, and didn't get any compile error. I'm also curious what was the error message you got.
There was a problem hiding this comment.
Yep, it compiles, but the reason is to prevent compiler to cast this to an overloaded method with a different object type in place of the decryptor etc. A sort of precaution.
There was a problem hiding this comment.
I can't imagine any scenario where this could happen. I think if there's an overloaded method with different parameters, and the compile can't figure out which method to call, then it would be a compilation error. In that case casting is the way to give hint to the compiler, that you'd like to call the method with BlockCipher.Decryptor and byte[] parameters, but in this case the method invocation is clear without casting too.
There was a problem hiding this comment.
I see. Ok, looks like we have two options: 1) leave the code as is - the cast helps code readers to understand what is nulled (in this case, that we call methods without encryption); also, it's harmless 2) remove the casts in this and other pull requests.
Please let me know your preference, I'm fine with either option.
There was a problem hiding this comment.
I personally prefer more compact code, and looks like @Fokko thinks the same, so if you don't mind, then let's remove those casts, that are not necessary and don't add any value!
There was a problem hiding this comment.
Ok, will do that.
|
|
||
| public static void readFileMetaData(InputStream from, FileMetaDataConsumer consumer) throws IOException { | ||
| readFileMetaData(from, consumer, false); | ||
| readFileMetaData(from, consumer, (BlockCipher.Decryptor) null, (byte[]) null); |
There was a problem hiding this comment.
I would omit the casting since it doesn't add any value.
| import java.io.IOException; | ||
| import java.io.InputStream; | ||
|
|
||
| public interface BlockCipher{ |
There was a problem hiding this comment.
I would split the Encryptor and Decryptor in separate files and put them in the BlockCipher package. WDYT?
There was a problem hiding this comment.
I'm reluctant to do that. Extra package, with two small files; extra file..
There was a problem hiding this comment.
I think keeping these interfaces in the same file is acceptable, they're not too big.
|
Hi Fokko, thanks for the review, much appreciated. Finally, I got some respite from the C++ PRs, now getting to the Java ones. |
| if (null == decryptor) { | ||
| from = input; | ||
| } | ||
| else { |
There was a problem hiding this comment.
nit: please move this line to the end of last line.
There was a problem hiding this comment.
nit, but not yet addressed, @ggershinsky could you please apply Junjie's recommendation?
There was a problem hiding this comment.
will do, sorry for missing this.
| } | ||
|
|
||
| public static void readFileMetaData(InputStream from, final FileMetaDataConsumer consumer, boolean skipRowGroups) throws IOException { | ||
| readFileMetaData(from, consumer, skipRowGroups, (BlockCipher.Decryptor) null, (byte[]) null); |
There was a problem hiding this comment.
Could you please paste what kind of compiling error you met? Maybe we can set an option in the compiler?
| * Encrypts the plaintext. | ||
| * | ||
| * @param plaintext The plaintext starts at offset 0 of the input value, and fills up the entire byte array. | ||
| * @param AAD Encryption AAD (ignored in case of CTR cipher) |
There was a problem hiding this comment.
Would be nice to not use abbreviation in the parameter description, but resolve it.
Something like: @param AAD Additional Authenticated Data for the encryption algorithm (ignored in case of CTR cipher)
| * The ciphertext starts at offset 4 and fills up the rest of the returned byte array. | ||
| * The ciphertext includes the nonce and (in case of GCM cipher) the tag, as detailed in the | ||
| * Parquet Modular Encryption specification. | ||
| * @throws IOException |
There was a problem hiding this comment.
Please document when does it throw IOException.
| * The ciphertext starts at offset 4 and fills up the rest of the input byte array. | ||
| * The ciphertext includes the nonce and (in case of GCM cipher) the tag, as detailed in the | ||
| * Parquet Modular Encryption specification. | ||
| * @param AAD Encryption AAD (ignored in case of CTR cipher) |
There was a problem hiding this comment.
Same as for Encryptor interface
| /** | ||
| * Convenience decryption method that reads the length and ciphertext from the input stream. | ||
| * | ||
| * @param from |
There was a problem hiding this comment.
Please document parameters, return value and exception. Also, I'd recommend use method overloading, and name this method as decrypt too. Eventually the only change between this and the other decrypt method is in the input parameter: the 'from' in this case is an input stream instead of a byte array, but otherwise they do the same.
| * Parquet Modular Encryption specification. | ||
| * @param AAD Encryption AAD (ignored in case of CTR cipher) | ||
| * @return plaintext The plaintext starts at offset 0 of the input value, and fills up the entire byte array. | ||
| * @throws IOException |
| to.write(encryptedBuffer); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
nit: remove empty line here
| @@ -0,0 +1,27 @@ | |||
| # Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
I'm afraid that in order for this to work, we'll need 3d8eae7 too. @ggershinsky should we rebase this branch onto master?
There was a problem hiding this comment.
@nandorKollar please ignore my previous text here:), now I see what you mean. Yes, rebasing this branch onto master sounds good to me.
There was a problem hiding this comment.
@nandorKollar can I ask you to rebase this branch? Or let me know if you prefer me to create a pull request for that.
There was a problem hiding this comment.
@ggershinsky done, let me retrigger Travis to see the test results.
There was a problem hiding this comment.
Seems that Travis fails because the default JDK version is now 11, #665 fixes it.
|
I think Travis fails, because the branch specific install script clones parquet-format inside the parquet-mr working folder (see this path /home/travis/build/apache/parquet-mr/parquet-format/pom.xml), and Rat check fails because parquet-format/blob/master/.github/PULL_REQUEST_TEMPLATE.md is not excluded (in format it is - that's why format builds are not failing, but in mr no exclude pattern matches this file). I recommend modifying the encryption specific before_install script with cloning and building parquet-format in build directory (before clone, move to TRAVIS_BUILD_DIR |
|
Hmm okay, my bad, TRAVIS_BUILD_DIR is the cloned repository path as per https://docs.travis-ci.com/user/environment-variables/. |
|
Ouch, hit log record limit. When building parquet-format, I recommend adding --batch-mode so that download progress report messages are not emitted. |
|
yesss. thanks @nandorKollar for helping with this! I'll get working on addressing the comments. |
|
Hi guys, I think I've addressed all comments, but let me know if I've missed anything. |
|
@ggershinsky I think you missed some of the comments from @Fokko about final variables and formatting. |
|
@nandorKollar I have made these variables final as per the recommendation from @Fokko. Have I missed some variables? I'll check again. |
|
@nandorKollar comments are addressed. |
* PARQUET-1228: Format Structures encryption (#613) * PARQUET-1286: Crypto package (#614) * PARQUET-1818: Fix bloom/encryption collision in format-structures (#771) * PARQUET-1817: Crypto Properties Factory (#769) * PARQUET-1229: Parquet MR encryption (#776) * PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782) * PARQUET-1373: Encryption key tools (#615) Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com> Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
No description provided.