PARQUET-1373: Encryption key tools#615
Conversation
1ce6bcc to
8b306bf
Compare
48a109a to
a681c00
Compare
|
@ggershinsky, is it still in draft state? |
|
the implementation is complete. with the recent addition of a unitest, the draft state can be switched off. I'll rebase it first, to make sure the Travis checks pass. |
2fc82e0 to
3554037
Compare
@gszadovszky The pr is completed, ready for a review |
gszadovszky
left a comment
There was a problem hiding this comment.
Reviewed the most of it but still have some left...
In general I would expect javadoc comments for API public classes/methods. Also, it would be nice to solve the acronyms where used even if they are well known in the crypto world.
There are a couple of concurrent code parts. We shall test somehow that the code is thread-safe. I know, it is not easy but if you execute the related code paths with several parallel threads (e.g. encrypting/decrypting several files in multiple threads) we might find some issues.
Some thoughts that might not tightly related to this PR but to the encryption feature:
I am not familiar with encryption. It would be nice if someone that has crypto experience could also sign this (or the final PR to master) even if she/he is not a committer.
As far as I remember, there were a couple of ongoing PRs related to encryption. Please, update/rebase them if they are required or close them if not.
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/FileKeyUnwrapper.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyToolkit.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyToolkit.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyToolkit.java
Outdated
Show resolved
Hide resolved
...t-hadoop/src/main/java/org/apache/parquet/crypto/keytools/PropertiesDrivenCryptoFactory.java
Outdated
Show resolved
Hide resolved
...t-hadoop/src/main/java/org/apache/parquet/crypto/keytools/PropertiesDrivenCryptoFactory.java
Outdated
Show resolved
Hide resolved
we'll make the PR test to run multiple threads, writing a number of files in parallel (same for reading). |
…e reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata. Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL. Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager. KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata.
…r token. KmsClient is per token and read/write KEK caches too. Add default token value for InMemoryKMS, which has no tokens. Use concurrentHashMap for caches with computeIfAbsent. Add expiration using to the caches - both time-based and on-demand. On expiration delete the per-token entries from caches. Add method for cache invalidation per token. Add abstract methods to be implemented by RemoteKmsClients.
instead of the higher-level ParquetCryptoRuntimeException. Change to constant names to uppercase.
…s/FileKeyUnwrapper.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/FileKeyUnwrapper.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/FileKeyUnwrapper.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/FileKeyWrapper.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/KeyToolkit.java Co-authored-by: Gabor Szadovszky <gabor@apache.org>
4404ff3 to
75d8976
Compare
gszadovszky
left a comment
There was a problem hiding this comment.
As for the VaultClient - one option is to move it also to the tests location, commenting out the HTTP calls so the okhttp3 dependency can be removed from the pom. The comments will explain the code and how to build it if needed. The other option is to remove VaultClient altogether from apache/parquet-mr (I can keep it eg in my fork, as a pointer if anybody's interested to see a sample). What do you think?
If you move VaultClient to test, okhttp3 will be a test dependency. I'm fine with that.
ConcurrentMap has a segment synchronization for write operations, and allows for synchronization-free read operations; this makes it faster than HasMap with synchronized methods.
Yes, I know how ConcurrentHashMap works. What I wanted to say that you are using synchronization as well. As you already use a ConcurrentMap you might implement these synchronized code parts by using the methods of ConcurrentMap. I've put some examples that might work.
Please, check why Travis fails.
Another point of view came up about handling sensitive data in memory. Java does not clean memory after garbage collecting objects. It means that sensitive data must be manually cleaned after used otherwise it might get compromised by another java application in the same jvm or even by another process after the jvm exists. Because of the same reason String objects shall never contain sensitive information as the char[] behind the object might not get garbage collected after the String object itself gets dropped.
I did not find any particular bad practice in the code or any examples of the listed situations just wanted to highlight that we shall think about this as well.
...t-hadoop/src/main/java/org/apache/parquet/crypto/keytools/PropertiesDrivenCryptoFactory.java
Outdated
Show resolved
Hide resolved
...t-hadoop/src/main/java/org/apache/parquet/crypto/keytools/PropertiesDrivenCryptoFactory.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java
Show resolved
Hide resolved
Sounds good, and thank you for the examples! We've already applied your code (not pushed yet), it indeed allowed to remove the explicit synchronization, making the cache implementation cleaner and faster.
Sorry, should have mentioned that it will take a few more commits to fully address this round of the comments (and fix the unitests). Once all commits are in, I will squash them to simplify the review, and will post a comment here.
Yep, keeping secret data in Java strings is a notorious problem. I think the general consensus is not to rely on gc or explicit byte wiping - but to remember that these Java processes must run in a trusted environment anyway, simply because they work with confidential information, ranging from the encryption keys to the sensitive data itself. Micro-managing the memory with confidential information is always hard, and is basically impossible with Java. It goes beyond Parquet. One example - the KMS Client implementations send secret tokens and fetch explicit encryption keys, using a custom HTTP library. There is no guarantee this library doesn't use strings (most likely, it does). Another example - the secret tokens are passed as a Hadoop property from Spark or another framework; this is likely to be implemented with strings. Moreover, the tokens are built in an access control system, then sent to a user, then sent to a Spark driver, then sent to Spark workers (or other framework components) - there is no way to control this, except to rely on HTTPS for the transport security, and on running framework drivers/workers in a trusted environment for the memory security. In other words, our threat model is simple. We don't trust the storage - encrypted Parquet files can be accessed by malicious parties, but they won't be able to read them. We do trust the framework hosts (where the JVM runs) - if these are breached, the secret data can be stolen from any part of host memory / disc pages; not just the Parquet lib memory, but framework memory, HTTP libs, etc. Memory protection is a holy grail in this field, addressed by technologies like VMs, containers, hardware enclaves, etc, etc. Parquet encryption is focused on data-in-storage protection; data-in-memory protection is covered by other technologies. |
|
Thanks a lot for explanation. Agreed on protecting sensitive data in memory cannot be handled by Parquet. I'll wait for your comment for the next review round. |
ac17c55 to
94c17e4
Compare
94c17e4 to
055bb39
Compare
|
@gszadovszky the 055bb39 is the squash of all commits that address the latest review round |
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyMaterial.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyMaterial.java
Show resolved
Hide resolved
...t-hadoop/src/main/java/org/apache/parquet/crypto/keytools/PropertiesDrivenCryptoFactory.java
Outdated
Show resolved
Hide resolved
|
@gszadovszky the latest review round is addressed by 0355f65 |
gszadovszky
left a comment
There was a problem hiding this comment.
Thanks a lot for the proper documentation. One more thing that I've missed before related to the jsons.
parquet-hadoop/src/main/java/org/apache/parquet/crypto/keytools/KeyMaterial.java
Show resolved
Hide resolved
|
Close/reopen this PR to re-trigger Travis. |
* comments * update key tools * double wrap for minimizing KMS calls * Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata. Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL. Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager. KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata. * major update - key rotation, crypto factory, etc * Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token. KmsClient is per token and read/write KEK caches too. Add default token value for InMemoryKMS, which has no tokens. Use concurrentHashMap for caches with computeIfAbsent. Add expiration using to the caches - both time-based and on-demand. On expiration delete the per-token entries from caches. Add method for cache invalidation per token. Add abstract methods to be implemented by RemoteKmsClients. * add in-memory KMS * Change RemoteKmsClient exceptions to IOException instead of the higher-level ParquetCryptoRuntimeException. Change to constant names to uppercase. * Add sample VaultClient. * interface changes * Add okHttp3 dependency for VaultClient sample. * wrapping changes * Use JSON serialization for key material. * separate write and read path, update caching * improved refactoring * key rotation improvements * Add TestPropertiesDrivenEncryption * get and resfresh token for all KMS clients * minor changes * Use ConcurrentHashMap for caches * caching and store updates * Rename some encryption/decryption configurations and make the test parameterized to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally. Add RemoteKmsClient mock for remote wrapping. * add removeCacheEntriesForAllTokens * Make common method setCommonKMSProperties and extract classname strings from classes * Change TestPropertiesDrivenEncryption to accomodate latest API changes. * Remove StringUtils * address review comments * key material documentation * Boolean objects Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
* comments * update key tools * double wrap for minimizing KMS calls * Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata. Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL. Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager. KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata. * major update - key rotation, crypto factory, etc * Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token. KmsClient is per token and read/write KEK caches too. Add default token value for InMemoryKMS, which has no tokens. Use concurrentHashMap for caches with computeIfAbsent. Add expiration using to the caches - both time-based and on-demand. On expiration delete the per-token entries from caches. Add method for cache invalidation per token. Add abstract methods to be implemented by RemoteKmsClients. * add in-memory KMS * Change RemoteKmsClient exceptions to IOException instead of the higher-level ParquetCryptoRuntimeException. Change to constant names to uppercase. * Add sample VaultClient. * interface changes * Add okHttp3 dependency for VaultClient sample. * wrapping changes * Use JSON serialization for key material. * separate write and read path, update caching * improved refactoring * key rotation improvements * Add TestPropertiesDrivenEncryption * get and resfresh token for all KMS clients * minor changes * Use ConcurrentHashMap for caches * caching and store updates * Rename some encryption/decryption configurations and make the test parameterized to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally. Add RemoteKmsClient mock for remote wrapping. * add removeCacheEntriesForAllTokens * Make common method setCommonKMSProperties and extract classname strings from classes * Change TestPropertiesDrivenEncryption to accomodate latest API changes. * Remove StringUtils * address review comments * key material documentation * Boolean objects Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
* PARQUET-1228: Format Structures encryption (#613) * PARQUET-1286: Crypto package (#614) * PARQUET-1818: Fix bloom/encryption collision in format-structures (#771) * PARQUET-1817: Crypto Properties Factory (#769) * PARQUET-1229: Parquet MR encryption (#776) * PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782) * PARQUET-1373: Encryption key tools (#615) Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com> Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
This PR is C++ implementation for parquet key tool, based on [the Java implementation](apache/parquet-java#615) and [the design doc](https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing). The major parts of this PR are: * higher level of encryption/decryption configuration, including key management configuration. * KMS connection configuration * Abstract class KmsClient, KmsClientFactory * PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties * unit test using InMemoryKms (an sample of KmsClient). Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull. Thanks! Closes #8023 from thamht4190/arrow-9318-encryption-key-management Lead-authored-by: Ha Thi Tham <thamht01188@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
No description provided.