Skip to content

PARQUET-1373: Encryption key tools#615

Merged
gszadovszky merged 36 commits intoapache:encryptionfrom
ggershinsky:p1373-encryption-key-tools
Jul 6, 2020
Merged

PARQUET-1373: Encryption key tools#615
gszadovszky merged 36 commits intoapache:encryptionfrom
ggershinsky:p1373-encryption-key-tools

Conversation

@ggershinsky
Copy link
Copy Markdown
Contributor

@ggershinsky ggershinsky commented Feb 11, 2019

No description provided.

@ggershinsky ggershinsky changed the title commit to encryption branch PARQUET-1373: Encryption key tools Feb 11, 2019
@ggershinsky ggershinsky force-pushed the p1373-encryption-key-tools branch from 48a109a to a681c00 Compare May 3, 2020 10:50
@gszadovszky
Copy link
Copy Markdown
Contributor

@ggershinsky, is it still in draft state?

@ggershinsky
Copy link
Copy Markdown
Contributor Author

the implementation is complete. with the recent addition of a unitest, the draft state can be switched off. I'll rebase it first, to make sure the Travis checks pass.

@ggershinsky ggershinsky force-pushed the p1373-encryption-key-tools branch from 2fc82e0 to 3554037 Compare June 4, 2020 08:14
@ggershinsky
Copy link
Copy Markdown
Contributor Author

@ggershinsky, is it still in draft state?

@gszadovszky The pr is completed, ready for a review

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the most of it but still have some left...

In general I would expect javadoc comments for API public classes/methods. Also, it would be nice to solve the acronyms where used even if they are well known in the crypto world.

There are a couple of concurrent code parts. We shall test somehow that the code is thread-safe. I know, it is not easy but if you execute the related code paths with several parallel threads (e.g. encrypting/decrypting several files in multiple threads) we might find some issues.

Some thoughts that might not tightly related to this PR but to the encryption feature:

I am not familiar with encryption. It would be nice if someone that has crypto experience could also sign this (or the final PR to master) even if she/he is not a committer.

As far as I remember, there were a couple of ongoing PRs related to encryption. Please, update/rebase them if they are required or close them if not.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

There are a couple of concurrent code parts. We shall test somehow that the code is thread-safe. I know, it is not easy but if you execute the related code paths with several parallel threads (e.g. encrypting/decrypting several files in multiple threads) we might find some issues.

we'll make the PR test to run multiple threads, writing a number of files in parallel (same for reading).

Gidon Gershinsky and others added 20 commits June 14, 2020 12:30
…e reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata.

Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL.
Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager.
KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata.
…r token.

KmsClient is per token and read/write KEK caches too.
Add default token value for InMemoryKMS, which has no tokens.
Use concurrentHashMap for caches with computeIfAbsent.
Add expiration using to the caches - both time-based and on-demand.
On expiration delete the per-token entries from caches.
Add method for cache invalidation per token.

Add abstract methods to be implemented by RemoteKmsClients.
instead of the higher-level ParquetCryptoRuntimeException.
Change to constant names to uppercase.
ggershinsky and others added 5 commits June 14, 2020 12:30
…s/FileKeyUnwrapper.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/FileKeyUnwrapper.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/FileKeyUnwrapper.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/FileKeyWrapper.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
…s/KeyToolkit.java

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
@ggershinsky ggershinsky force-pushed the p1373-encryption-key-tools branch from 4404ff3 to 75d8976 Compare June 14, 2020 09:37
Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the VaultClient - one option is to move it also to the tests location, commenting out the HTTP calls so the okhttp3 dependency can be removed from the pom. The comments will explain the code and how to build it if needed. The other option is to remove VaultClient altogether from apache/parquet-mr (I can keep it eg in my fork, as a pointer if anybody's interested to see a sample). What do you think?

If you move VaultClient to test, okhttp3 will be a test dependency. I'm fine with that.


ConcurrentMap has a segment synchronization for write operations, and allows for synchronization-free read operations; this makes it faster than HasMap with synchronized methods.

Yes, I know how ConcurrentHashMap works. What I wanted to say that you are using synchronization as well. As you already use a ConcurrentMap you might implement these synchronized code parts by using the methods of ConcurrentMap. I've put some examples that might work.


Please, check why Travis fails.


Another point of view came up about handling sensitive data in memory. Java does not clean memory after garbage collecting objects. It means that sensitive data must be manually cleaned after used otherwise it might get compromised by another java application in the same jvm or even by another process after the jvm exists. Because of the same reason String objects shall never contain sensitive information as the char[] behind the object might not get garbage collected after the String object itself gets dropped.
I did not find any particular bad practice in the code or any examples of the listed situations just wanted to highlight that we shall think about this as well.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

ggershinsky commented Jun 15, 2020

ConcurrentMap has a segment synchronization for write operations, and allows for synchronization-free read operations; this makes it faster than HasMap with synchronized methods.

Yes, I know how ConcurrentHashMap works. What I wanted to say that you are using synchronization as well. As you already use a ConcurrentMap you might implement these synchronized code parts by using the methods of ConcurrentMap. I've put some examples that might work.

Sounds good, and thank you for the examples! We've already applied your code (not pushed yet), it indeed allowed to remove the explicit synchronization, making the cache implementation cleaner and faster.

Please, check why Travis fails.

Sorry, should have mentioned that it will take a few more commits to fully address this round of the comments (and fix the unitests). Once all commits are in, I will squash them to simplify the review, and will post a comment here.

Another point of view came up about handling sensitive data in memory. Java does not clean memory after garbage collecting objects. It means that sensitive data must be manually cleaned after used otherwise it might get compromised by another java application in the same jvm or even by another process after the jvm exists. Because of the same reason String objects shall never contain sensitive information as the char[] behind the object might not get garbage collected after the String object itself gets dropped.
I did not find any particular bad practice in the code or any examples of the listed situations just wanted to highlight that we shall think about this as well.

Yep, keeping secret data in Java strings is a notorious problem. I think the general consensus is not to rely on gc or explicit byte wiping - but to remember that these Java processes must run in a trusted environment anyway, simply because they work with confidential information, ranging from the encryption keys to the sensitive data itself. Micro-managing the memory with confidential information is always hard, and is basically impossible with Java. It goes beyond Parquet. One example - the KMS Client implementations send secret tokens and fetch explicit encryption keys, using a custom HTTP library. There is no guarantee this library doesn't use strings (most likely, it does). Another example - the secret tokens are passed as a Hadoop property from Spark or another framework; this is likely to be implemented with strings. Moreover, the tokens are built in an access control system, then sent to a user, then sent to a Spark driver, then sent to Spark workers (or other framework components) - there is no way to control this, except to rely on HTTPS for the transport security, and on running framework drivers/workers in a trusted environment for the memory security.

In other words, our threat model is simple. We don't trust the storage - encrypted Parquet files can be accessed by malicious parties, but they won't be able to read them. We do trust the framework hosts (where the JVM runs) - if these are breached, the secret data can be stolen from any part of host memory / disc pages; not just the Parquet lib memory, but framework memory, HTTP libs, etc. Memory protection is a holy grail in this field, addressed by technologies like VMs, containers, hardware enclaves, etc, etc. Parquet encryption is focused on data-in-storage protection; data-in-memory protection is covered by other technologies.

@gszadovszky
Copy link
Copy Markdown
Contributor

Thanks a lot for explanation. Agreed on protecting sensitive data in memory cannot be handled by Parquet.

I'll wait for your comment for the next review round.

@ggershinsky ggershinsky force-pushed the p1373-encryption-key-tools branch from ac17c55 to 94c17e4 Compare June 25, 2020 13:29
@ggershinsky ggershinsky force-pushed the p1373-encryption-key-tools branch from 94c17e4 to 055bb39 Compare June 25, 2020 13:39
@ggershinsky
Copy link
Copy Markdown
Contributor Author

@gszadovszky the 055bb39 is the squash of all commits that address the latest review round

@ggershinsky
Copy link
Copy Markdown
Contributor Author

@gszadovszky the latest review round is addressed by 0355f65

Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the proper documentation. One more thing that I've missed before related to the jsons.

@gszadovszky
Copy link
Copy Markdown
Contributor

Close/reopen this PR to re-trigger Travis.

@gszadovszky gszadovszky closed this Jul 1, 2020
@gszadovszky gszadovszky reopened this Jul 1, 2020
@gszadovszky gszadovszky merged commit 7b7099f into apache:encryption Jul 6, 2020
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Jul 15, 2020
* comments
* update key tools
* double wrap for minimizing KMS calls
* Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata.
Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL.
Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager.
KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata.
* major update - key rotation, crypto factory, etc
* Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token.
KmsClient is per token and read/write KEK caches too.
Add default token value for InMemoryKMS, which has no tokens.
Use concurrentHashMap for caches with computeIfAbsent.
Add expiration using to the caches - both time-based and on-demand.
On expiration delete the per-token entries from caches.
Add method for cache invalidation per token.
Add abstract methods to be implemented by RemoteKmsClients.
* add in-memory KMS
* Change RemoteKmsClient exceptions to IOException
instead of the higher-level ParquetCryptoRuntimeException.
Change to constant names to uppercase.
* Add sample VaultClient.
* interface changes
* Add okHttp3 dependency for VaultClient sample.
* wrapping changes
* Use JSON serialization for key material.
* separate write and read path, update caching
* improved refactoring
* key rotation improvements
* Add TestPropertiesDrivenEncryption
* get and resfresh token for all KMS clients
* minor changes
* Use ConcurrentHashMap for caches
* caching and store updates
* Rename some encryption/decryption configurations and make the test parameterized
to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally.
Add RemoteKmsClient mock for remote wrapping.
* add removeCacheEntriesForAllTokens
* Make common method setCommonKMSProperties and extract classname strings from classes
* Change TestPropertiesDrivenEncryption to accomodate latest API changes.
* Remove StringUtils
* address review comments
* key material documentation
* Boolean objects

Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
gszadovszky pushed a commit that referenced this pull request Jul 28, 2020
* comments
* update key tools
* double wrap for minimizing KMS calls
* Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata.
Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL.
Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager.
KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata.
* major update - key rotation, crypto factory, etc
* Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token.
KmsClient is per token and read/write KEK caches too.
Add default token value for InMemoryKMS, which has no tokens.
Use concurrentHashMap for caches with computeIfAbsent.
Add expiration using to the caches - both time-based and on-demand.
On expiration delete the per-token entries from caches.
Add method for cache invalidation per token.
Add abstract methods to be implemented by RemoteKmsClients.
* add in-memory KMS
* Change RemoteKmsClient exceptions to IOException
instead of the higher-level ParquetCryptoRuntimeException.
Change to constant names to uppercase.
* Add sample VaultClient.
* interface changes
* Add okHttp3 dependency for VaultClient sample.
* wrapping changes
* Use JSON serialization for key material.
* separate write and read path, update caching
* improved refactoring
* key rotation improvements
* Add TestPropertiesDrivenEncryption
* get and resfresh token for all KMS clients
* minor changes
* Use ConcurrentHashMap for caches
* caching and store updates
* Rename some encryption/decryption configurations and make the test parameterized
to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally.
Add RemoteKmsClient mock for remote wrapping.
* add removeCacheEntriesForAllTokens
* Make common method setCommonKMSProperties and extract classname strings from classes
* Change TestPropertiesDrivenEncryption to accomodate latest API changes.
* Remove StringUtils
* address review comments
* key material documentation
* Boolean objects

Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
gszadovszky pushed a commit that referenced this pull request Jul 29, 2020
* PARQUET-1228: Format Structures encryption (#613)
* PARQUET-1286: Crypto package (#614)
* PARQUET-1818: Fix bloom/encryption collision in format-structures (#771)
* PARQUET-1817: Crypto Properties Factory (#769)
* PARQUET-1229: Parquet MR encryption (#776)
* PARQUET-1807: Encryption: Interop and Function test suite for Java version (#782)
* PARQUET-1373: Encryption key tools (#615)

Co-authored-by: shangxinli <31421745+shangxinli@users.noreply.github.com>
Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
pitrou added a commit to apache/arrow that referenced this pull request Mar 17, 2021
This PR is C++ implementation for parquet key tool, based on [the Java implementation](apache/parquet-java#615) and [the design doc](https://docs.google.com/document/d/1boH6HPkG0ZhgxcaRkGk3QpZ8X_J91uXZwVGwYN45St4/edit?usp=sharing).

The major parts of this PR are:
* higher level of encryption/decryption configuration, including key management configuration.
* KMS connection configuration
* Abstract class KmsClient, KmsClientFactory
* PropertiesDrivenCryptoFactory class to convert these above configurations to the lower level FileEncryptionProperties, FileDecryptionProperties
* unit test using InMemoryKms (an sample of KmsClient).

Comparing to Java version, this C++ pull doesn't contain externally storing key material using hadoop file system (only storing key material internally in parquet file is supported for now). The reason is lack of understanding about Hadoop file system, can be implemented it later in another pull.

Thanks!

Closes #8023 from thamht4190/arrow-9318-encryption-key-management

Lead-authored-by: Ha Thi Tham <thamht01188@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants