Skip to content

PARQUET-1817: Crypto Properties Factory#769

Merged
gszadovszky merged 16 commits intoapache:encryptionfrom
shangxinli:parquet-1817
Mar 30, 2020
Merged

PARQUET-1817: Crypto Properties Factory#769
gszadovszky merged 16 commits intoapache:encryptionfrom
shangxinli:parquet-1817

Conversation

@shangxinli
Copy link
Copy Markdown
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@ggershinsky
Copy link
Copy Markdown
Contributor

Travis fails because the branch is broken (on parquet-format-structures - unhandled mix of encryption and bloom filters). A known problem, I'll send a fix for that today.

@ggershinsky
Copy link
Copy Markdown
Contributor

the branch is fixed by #771 . Once it is merged, your pr should pass CI checks after rebase.

@gszadovszky
Copy link
Copy Markdown
Contributor

#771 is merged, Travis restarted.

@ggershinsky
Copy link
Copy Markdown
Contributor

"[WARNING] Files with unapproved licenses:
/home/travis/build/apache/parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/crypto/CryptoPropertiesFactory.java"

@shangxinli
Copy link
Copy Markdown
Contributor Author

Fixed it. Thanks Gidon and Gabor!

@ggershinsky
Copy link
Copy Markdown
Contributor

@shangxinli @gszadovszky my pull request is ready, but now it depends on this one. I'll send it after 769 is merged.

@shangxinli
Copy link
Copy Markdown
Contributor Author

shangxinli commented Mar 19, 2020

@shangxinli @gszadovszky my pull request is ready, but now it depends on this one. I'll send it after 769 is merged.

The build is green now. Once @gszadovszky review the code, he or myself can merge it and you are ready to go.

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we will see some implementations of this class in the unit tests of the crypto feature. Just double check...

@shangxinli
Copy link
Copy Markdown
Contributor Author

I assume we will see some implementations of this class in the unit tests of the crypto feature. Just double check...

Yes.

* the unit test (TBD) for example
*
* @param fileHadoopConfig Configuration that is used to pass the needed information, e.g. KMS uri
* @param tempFilePath File path of the parquet file
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky, can you make comments about what to be used by this field? Our implementation doesn't need it so far.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest also to add a comment to the interface itself, to remind about its purpose (this helps to understand the method parameters), something like "CryptoPropertiesFactory interface enables transparent activation of Parquet encryption. It's custom implementations produce encryption and decryption properties for each Parquet file, using the information available in Parquet file writers and readers: file path, file extended schema (in writers only) - and also Hadoop configuration properties that can pass custom parameters required by a crypto factory. A factory implementation can use or ignore any of these parameters."

As for the "tempFilePath" - "File path of the parquet file being written. Can be used for AAD prefix creation, key material management, etc. Implementations must not presume the path is permanent, as the file can be moved or renamed later"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point


import java.io.IOException;

import static org.apache.parquet.Preconditions.checkNotNull;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import is not needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@ggershinsky
Copy link
Copy Markdown
Contributor

strange. All Travis checks are green,
https://travis-ci.org/github/apache/parquet-mr/builds/666364206?utm_source=github_status&utm_medium=notification
but the PR is stuck on yellow.

@chenjunjiedada
Copy link
Copy Markdown
Contributor

@ggershinsky , FYI, Iceberg has a PR that has the same status.

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I've forgot about one major thing. (Sorry for the late idea.)
parquet-mr library is usually used for either reading or writing at a time but not both. I think, it is not a good practice to expect both encryption and decryption properties while only one would be required.
What do you guys think?

@ggershinsky
Copy link
Copy Markdown
Contributor

Last time I've forgot about one major thing. (Sorry for the late idea.)
parquet-mr library is usually used for either reading or writing at a time but not both. I think, it is not a good practice to expect both encryption and decryption properties while only one would be required.
What do you guys think?

On one hand, its certainly possible to split this into two interfaces, to make it fit the parquet-mr practices. On the other hand, a person working with encryption in eg Spark shell might want to both write and read the files. He would define two Hadoop config params for that, "parquet.encryption.factory.class" and "parquet.decryption.factory.class", which is a bit cumbersome. Or maybe one parameter is enough - if the class implements both interfaces, and we'll load it as either encryption or decryption as needed (from writer or reader, respectively). If this works in Java (currently can't see why not), then could be the right thing to do.

shangxinli and others added 3 commits March 25, 2020 07:23
…ropertiesFactory.java

Co-Authored-By: Gabor Szadovszky <gabor@apache.org>
…ropertiesFactory.java

Co-Authored-By: Gabor Szadovszky <gabor@apache.org>
…ropertiesFactory.java

Co-Authored-By: Gabor Szadovszky <gabor@apache.org>
@gszadovszky
Copy link
Copy Markdown
Contributor

Last time I've forgot about one major thing. (Sorry for the late idea.)
parquet-mr library is usually used for either reading or writing at a time but not both. I think, it is not a good practice to expect both encryption and decryption properties while only one would be required.
What do you guys think?

On one hand, its certainly possible to split this into two interfaces, to make it fit the parquet-mr practices. On the other hand, a person working with encryption in eg Spark shell might want to both write and read the files. He would define two Hadoop config params for that, "parquet.encryption.factory.class" and "parquet.decryption.factory.class", which is a bit cumbersome. Or maybe one parameter is enough - if the class implements both interfaces, and we'll load it as either encryption or decryption as needed (from writer or reader, respectively). If this works in Java (currently can't see why not), then could be the right thing to do.

I like this idea. It should work just fine.

@shangxinli
Copy link
Copy Markdown
Contributor Author

Last time I've forgot about one major thing. (Sorry for the late idea.)
parquet-mr library is usually used for either reading or writing at a time but not both. I think, it is not a good practice to expect both encryption and decryption properties while only one would be required.
What do you guys think?

On one hand, its certainly possible to split this into two interfaces, to make it fit the parquet-mr practices. On the other hand, a person working with encryption in eg Spark shell might want to both write and read the files. He would define two Hadoop config params for that, "parquet.encryption.factory.class" and "parquet.decryption.factory.class", which is a bit cumbersome. Or maybe one parameter is enough - if the class implements both interfaces, and we'll load it as either encryption or decryption as needed (from writer or reader, respectively). If this works in Java (currently can't see why not), then could be the right thing to do.

I agree with Gidon. Aside from that, some ETL transform jobs need to decrypt raw data Parquet files and encrypt transformed Parquet files.

shangxinli and others added 4 commits March 25, 2020 07:57
…ropertiesFactory.java

Co-Authored-By: ggershinsky <ggershinsky@users.noreply.github.com>
…ropertiesFactory.java

Co-Authored-By: ggershinsky <ggershinsky@users.noreply.github.com>
@shangxinli
Copy link
Copy Markdown
Contributor Author

@gszadovszky, The latest commit needs to be changed and I think it would be better to add unit tests now than TBD. You can hold off for now and I will let you know when it is ready for review.

@shangxinli
Copy link
Copy Markdown
Contributor Author

@gszadovszky, I think the change is ready to review now.

shangxinli and others added 4 commits March 27, 2020 08:18
…ionPropertiesFactory.java

Co-Authored-By: Gabor Szadovszky <gabor@apache.org>
…CryptoRuntimeException.java

Co-Authored-By: Gabor Szadovszky <gabor@apache.org>
…ionPropertiesFactory.java

Co-Authored-By: Gabor Szadovszky <gabor@apache.org>
@gszadovszky gszadovszky merged commit bd876bd into apache:encryption Mar 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Apr 20, 2020
@shangxinli shangxinli deleted the parquet-1817 branch June 7, 2020 21:51
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Jul 15, 2020
gszadovszky pushed a commit that referenced this pull request Jul 29, 2020
* PARQUET-1228: Format Structures encryption (#613)
* PARQUET-1286: Crypto package (#614)
* PARQUET-1818: Fix bloom/encryption collision in format-structures (#771)
* PARQUET-1817: Crypto Properties Factory (#769)
* PARQUET-1229: Parquet MR encryption (#776)
* PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782)
* PARQUET-1373: Encryption key tools (#615)

Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com>
Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants