PARQUET-1229: Parquet MR encryption#776
Conversation
|
@gszadovszky The encryption pull request is submitted, with changes in "hadoop" and "column" modules. Travis fails on "The job exceeded the maximum log length, and has been terminated". All tests up that point pass successfully (including the tests in hadoop and column modules). |
|
FYI: Created PARQUET-1832. I'll try to fix this issue ASAP. |
247a2a5 to
45b917f
Compare
|
Thank you |
…rsion. Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
|
Is this ready to review? Since there is no comment yet, can you squash it to one single commit to make the review easier? |
|
preferably reviewed after the Travis fix is in. @gszadovszky @shangxinli can you apply #777 to the encryption branch. |
This can be reviewed as a single commit at |
fe7ab01 to
c6b0cde
Compare
|
the PR is ready for review |
It seems I still don't have permission to do so. @gszadovszky, I did 'git cherry-pick' and then git push to encryption branch', but got permission error. Do I have to create a PR for it? |
| } | ||
|
|
||
| private void addRowGroup(ParquetMetadata parquetMetadata, List<RowGroup> rowGroups, BlockMetaData block) { | ||
| private void addRowGroup(ParquetMetadata parquetMetadata, List<RowGroup> rowGroups, BlockMetaData block, |
There was a problem hiding this comment.
Should we keep original addRowGroup() intact and just add a new one with InternalFileEncryptor to isolate the change's impact? There is not much duplicate code if doing so and we can refactor the existing code with helper functions. In the majority use cases, they are non-encryption cases.
There was a problem hiding this comment.
encryption is isolated there, with if (null != fileEncryptor) and if (encryptMetaData) - similar to the if (columnIndexRef != null) and if (offsetIndexRef != null) in the same function.
There was a problem hiding this comment.
Not all the changes are isolated. Generally, adding 'if/else' will add diverge the code and add the complexity. One other thing is regression thinking. If fileEncryptor is null, which would be most of the case, then it just executes the existing method without change. It would be less error prone.
There was a problem hiding this comment.
Many functions can be run either with or without encryption. Duplicating them will result in hundreds or thousands of duplicate code lines. This will make code maintenance (changes/fixes) a headache. Instead, we isolate encryption with if switches, without duplicating the existing code.
The same goes for other recent new features (column indexes and bloom filters) - they are isolated with an if switch, instead of code duplication. See if (columnIndexRef != null) and if (offsetIndexRef != null) in this addRowGroup function.
There was a problem hiding this comment.
It is not a blocking comment and I am fine with it. But generally speaking, adding too much nested if/else diverges the code path and causes the complexity for reading. One way to avoid duplicating is to wrap them up in helper functions. I understand column indexes and bloom filters already did that but if we keep adding features like this, the code will become less and less readable.
| ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData); | ||
|
|
||
| if (!encryptedFooter && null != fileDecryptor) { | ||
| if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file |
There was a problem hiding this comment.
is there a requirement in the code formatting rules in this community to keep comments in separate lines?
There was a problem hiding this comment.
No. I see most of them on up line but a few on the same line. It is not a must.
There was a problem hiding this comment.
Actually, based on parquet-mr README:
Generally speaking, stick to the Sun Java Code Conventions
Based on the related section both should be fine.
@shangxinli, I've done a rebase on the branch |
|
@ggershinsky Do you have time to review the code? |
…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
You probably meant @gszadovszky |
gszadovszky
left a comment
There was a problem hiding this comment.
I have a couple of comments but did not review all yet.
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndex.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCipher.java
Outdated
Show resolved
Hide resolved
parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java
Outdated
Show resolved
Hide resolved
| ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData); | ||
|
|
||
| if (!encryptedFooter && null != fileDecryptor) { | ||
| if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file |
There was a problem hiding this comment.
Actually, based on parquet-mr README:
Generally speaking, stick to the Sun Java Code Conventions
Based on the related section both should be fine.
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
gszadovszky
left a comment
There was a problem hiding this comment.
Second round. Still not finished. :(
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
gszadovszky
left a comment
There was a problem hiding this comment.
Some other comments. This time, I've reached the end. :)
I know, this is not the whole feature yet but would like to ensure that at the end we will have proper test coverity. So, the question is do we have or planning to have tests to verify the working of semi-encrypted files (some columns and the footer are plaintext) with previous readers.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestBloomEncryption.java
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryption.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryption.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryption.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryption.java
Outdated
Show resolved
Hide resolved
gszadovszky
left a comment
There was a problem hiding this comment.
Now, we are in sync. I will try to keep it this way. ;)
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCipher.java
Outdated
Show resolved
Hide resolved
We test this manually. We also have an automatic test that runs the current reader code on files that were previously created with other writers (such as parquet-cpp with encryption; these files are kept in parquet-testing repo and fetched from there). But I don't know how to automate running of previous readers on the files written by the current code - ? |
…FileWriter.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…a/ColumnChunkMetaData.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
Appreciated! |
Good question. We should have the tooling to provide such tests but not sure if we already have it. I am fine with testing it manually for now just document it in details (e.g. adding to the git comment of the test PR). |
|
The current review round is addressed; the checks pass. |
…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
* PARQUET-1228: Format Structures encryption (#613) * PARQUET-1286: Crypto package (#614) * PARQUET-1818: Fix bloom/encryption collision in format-structures (#771) * PARQUET-1817: Crypto Properties Factory (#769) * PARQUET-1229: Parquet MR encryption (#776) * PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782) * PARQUET-1373: Encryption key tools (#615) Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com> Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation