Skip to content

PARQUET-1229: Parquet MR encryption#776

Merged
gszadovszky merged 17 commits intoapache:encryptionfrom
ggershinsky:p1229-parquet-mr-encryption
May 29, 2020
Merged

PARQUET-1229: Parquet MR encryption#776
gszadovszky merged 17 commits intoapache:encryptionfrom
ggershinsky:p1229-parquet-mr-encryption

Conversation

@ggershinsky
Copy link
Copy Markdown
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@ggershinsky
Copy link
Copy Markdown
Contributor Author

@gszadovszky The encryption pull request is submitted, with changes in "hadoop" and "column" modules. Travis fails on "The job exceeded the maximum log length, and has been terminated". All tests up that point pass successfully (including the tests in hadoop and column modules).

@gszadovszky
Copy link
Copy Markdown
Contributor

FYI: Created PARQUET-1832. I'll try to fix this issue ASAP.

@ggershinsky ggershinsky force-pushed the p1229-parquet-mr-encryption branch from 247a2a5 to 45b917f Compare April 1, 2020 15:31
@ggershinsky
Copy link
Copy Markdown
Contributor Author

Thank you

andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request Apr 6, 2020
…rsion.

Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption and decryption configurations.
Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request Apr 6, 2020
…rsion

Depends on PR apache#776 for [PARQUET-1229] and
on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
@shangxinli
Copy link
Copy Markdown
Contributor

Is this ready to review? Since there is no comment yet, can you squash it to one single commit to make the review easier?

@ggershinsky
Copy link
Copy Markdown
Contributor Author

preferably reviewed after the Travis fix is in. @gszadovszky @shangxinli can you apply #777 to the encryption branch.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

can you squash it to one single commit to make the review easier?

This can be reviewed as a single commit at
https://github.com/apache/parquet-mr/pull/776/files

@ggershinsky ggershinsky force-pushed the p1229-parquet-mr-encryption branch from fe7ab01 to c6b0cde Compare April 22, 2020 10:33
@ggershinsky
Copy link
Copy Markdown
Contributor Author

the PR is ready for review

@shangxinli
Copy link
Copy Markdown
Contributor

preferably reviewed after the Travis fix is in. @gszadovszky @shangxinli can you apply #777 to the encryption branch.

It seems I still don't have permission to do so. @gszadovszky, I did 'git cherry-pick' and then git push to encryption branch', but got permission error. Do I have to create a PR for it?

}

private void addRowGroup(ParquetMetadata parquetMetadata, List<RowGroup> rowGroups, BlockMetaData block) {
private void addRowGroup(ParquetMetadata parquetMetadata, List<RowGroup> rowGroups, BlockMetaData block,
Copy link
Copy Markdown
Contributor

@shangxinli shangxinli Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep original addRowGroup() intact and just add a new one with InternalFileEncryptor to isolate the change's impact? There is not much duplicate code if doing so and we can refactor the existing code with helper functions. In the majority use cases, they are non-encryption cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encryption is isolated there, with if (null != fileEncryptor) and if (encryptMetaData) - similar to the if (columnIndexRef != null) and if (offsetIndexRef != null) in the same function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the changes are isolated. Generally, adding 'if/else' will add diverge the code and add the complexity. One other thing is regression thinking. If fileEncryptor is null, which would be most of the case, then it just executes the existing method without change. It would be less error prone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many functions can be run either with or without encryption. Duplicating them will result in hundreds or thousands of duplicate code lines. This will make code maintenance (changes/fixes) a headache. Instead, we isolate encryption with if switches, without duplicating the existing code.

The same goes for other recent new features (column indexes and bloom filters) - they are isolated with an if switch, instead of code duplication. See if (columnIndexRef != null) and if (offsetIndexRef != null) in this addRowGroup function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a blocking comment and I am fine with it. But generally speaking, adding too much nested if/else diverges the code path and causes the complexity for reading. One way to avoid duplicating is to wrap them up in helper functions. I understand column indexes and bloom filters already did that but if we keep adding features like this, the code will become less and less readable.

ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData);

if (!encryptedFooter && null != fileDecryptor) {
if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move comments up

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a requirement in the code formatting rules in this community to keep comments in separate lines?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I see most of them on up line but a few on the same line. It is not a must.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, based on parquet-mr README:

Generally speaking, stick to the Sun Java Code Conventions

Based on the related section both should be fine.

@gszadovszky
Copy link
Copy Markdown
Contributor

preferably reviewed after the Travis fix is in. @gszadovszky @shangxinli can you apply #777 to the encryption branch.

It seems I still don't have permission to do so. @gszadovszky, I did 'git cherry-pick' and then git push to encryption branch', but got permission error. Do I have to create a PR for it?

@shangxinli, I've done a rebase on the branch encryption. Maybe, that's why you were unable to push a commit because after the rebase it would not be a fast forward commit (would need a force-push).
(Because of the rebase the Travis fix is already in.)
If you have any problems with manual push to the repo, let's bring it offline. I'm happy to help.

@shangxinli
Copy link
Copy Markdown
Contributor

@ggershinsky Do you have time to review the code?

andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request May 10, 2020
…rsion

Depends on PR apache#776 for [PARQUET-1229] and
on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
@ggershinsky
Copy link
Copy Markdown
Contributor Author

@ggershinsky Do you have time to review the code?

You probably meant @gszadovszky

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments but did not review all yet.

ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData);

if (!encryptedFooter && null != fileDecryptor) {
if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, based on parquet-mr README:

Generally speaking, stick to the Sun Java Code Conventions

Based on the related section both should be fine.

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round. Still not finished. :(

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other comments. This time, I've reached the end. :)

I know, this is not the whole feature yet but would like to ensure that at the end we will have proper test coverity. So, the question is do we have or planning to have tests to verify the working of semi-encrypted files (some columns and the footer are plaintext) with previous readers.

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we are in sync. I will try to keep it this way. ;)

@ggershinsky
Copy link
Copy Markdown
Contributor Author

I know, this is not the whole feature yet but would like to ensure that at the end we will have proper test coverity. So, the question is do we have or planning to have tests to verify the working of semi-encrypted files (some columns and the footer are plaintext) with previous readers.

We test this manually. We also have an automatic test that runs the current reader code on files that were previously created with other writers (such as parquet-cpp with encryption; these files are kept in parquet-testing repo and fetched from there). But I don't know how to automate running of previous readers on the files written by the current code - ?

ggershinsky and others added 2 commits May 21, 2020 17:53
…FileWriter.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…a/ColumnChunkMetaData.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
@ggershinsky
Copy link
Copy Markdown
Contributor Author

Now, we are in sync. I will try to keep it this way. ;)

Appreciated!
Addressing today's comments will keep me busy for a while :), I'll get back next week.

@gszadovszky
Copy link
Copy Markdown
Contributor

I know, this is not the whole feature yet but would like to ensure that at the end we will have proper test coverity. So, the question is do we have or planning to have tests to verify the working of semi-encrypted files (some columns and the footer are plaintext) with previous readers.

We test this manually. We also have an automatic test that runs the current reader code on files that were previously created with other writers (such as parquet-cpp with encryption; these files are kept in parquet-testing repo and fetched from there). But I don't know how to automate running of previous readers on the files written by the current code - ?

Good question. We should have the tooling to provide such tests but not sure if we already have it. I am fine with testing it manually for now just document it in details (e.g. adding to the git comment of the test PR).

@ggershinsky
Copy link
Copy Markdown
Contributor Author

The current review round is addressed; the checks pass.

@gszadovszky gszadovszky merged commit 06b5372 into apache:encryption May 29, 2020
andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request May 29, 2020
…rsion

Depends on PR apache#776 for [PARQUET-1229] and
on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Jul 15, 2020
gszadovszky pushed a commit that referenced this pull request Jul 28, 2020
gszadovszky pushed a commit that referenced this pull request Jul 29, 2020
* PARQUET-1228: Format Structures encryption (#613)
* PARQUET-1286: Crypto package (#614)
* PARQUET-1818: Fix bloom/encryption collision in format-structures (#771)
* PARQUET-1817: Crypto Properties Factory (#769)
* PARQUET-1229: Parquet MR encryption (#776)
* PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782)
* PARQUET-1373: Encryption key tools (#615)

Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com>
Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants