PARQUET-1884: Encryption branch merge in master#800
Conversation
|
@ggershinsky, could you resolve the conflicts? |
…rsion (#782) Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule.
* comments * update key tools * double wrap for minimizing KMS calls * Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata. Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL. Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager. KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata. * major update - key rotation, crypto factory, etc * Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token. KmsClient is per token and read/write KEK caches too. Add default token value for InMemoryKMS, which has no tokens. Use concurrentHashMap for caches with computeIfAbsent. Add expiration using to the caches - both time-based and on-demand. On expiration delete the per-token entries from caches. Add method for cache invalidation per token. Add abstract methods to be implemented by RemoteKmsClients. * add in-memory KMS * Change RemoteKmsClient exceptions to IOException instead of the higher-level ParquetCryptoRuntimeException. Change to constant names to uppercase. * Add sample VaultClient. * interface changes * Add okHttp3 dependency for VaultClient sample. * wrapping changes * Use JSON serialization for key material. * separate write and read path, update caching * improved refactoring * key rotation improvements * Add TestPropertiesDrivenEncryption * get and resfresh token for all KMS clients * minor changes * Use ConcurrentHashMap for caches * caching and store updates * Rename some encryption/decryption configurations and make the test parameterized to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally. Add RemoteKmsClient mock for remote wrapping. * add removeCacheEntriesForAllTokens * Make common method setCommonKMSProperties and extract classname strings from classes * Change TestPropertiesDrivenEncryption to accomodate latest API changes. * Remove StringUtils * address review comments * key material documentation * Boolean objects Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
gszadovszky
left a comment
There was a problem hiding this comment.
A have some minor findings.
One more thing that we failed to check before. It is the CRC calculation. (See TestDataPageV1Checksums and the thrift file for details.) How does the CRC work in case of encrypted pages? Is the definition in the thrift file still stand (CRC is calculated after compression but before encryption)?
At this point I am not requesting any additional implementation but we shall check at least that the CRC calculation also works fine with encryption.
| @@ -1,4 +1,5 @@ | |||
| language: java | |||
| jdk: openjdk8 | |||
There was a problem hiding this comment.
jdk is already listed, this one is not required
| --- | ||
|
|
||
| **Property:** `parquet.crypto.factory.class` | ||
| **Description:** Class implementing EncryptionPropertiesFactory. |
There was a problem hiding this comment.
Please, add double spaces to the end of the lines to enforce newlines. (Maybe it was my mistake during the rebase that they were removed.)
(You may check the result by reviewing the rendering of the file on the github UI.)
Per https://github.com/apache/parquet-mr/pull/800/files#diff-d007a18083a2431c30a5416f248e0a4bR169 and https://github.com/apache/parquet-mr/pull/800/files#diff-d007a18083a2431c30a5416f248e0a4bR183, CRC is calculated after compression and encryption of page bytes. Which makes sense, because these are the bytes that are written to the disk. The Thrift comment says CRC is calculated after compression (doesn't mention encryption), but we should have modified this comment. One more point - AEG_GCM mode verifies page integrity (in CPU hardware), so CRC is not needed in this encryption mode. |
Btw, this is the default encryption mode. In Java 9 and later, there is no reason to use encryption and CRC together. |
|
Currently CRC check is configurable by hadoop conf |
|
Per the sync discussion,
|
|
@gszadovszky TestDataPageV1Checksums didn't work out, because it uses an old parquet API, with deprecated methods we've decided not to extend for encryption. So instead, I've run the ColumnIndexFiltering test, that uses the valid parquet API. I've modified it to add encryption/decryption, and enablePageWriteChecksum()/usePageChecksumVerification(true) to this test, everything worked ok. Regarding the travis.yaml and hadoop/readme.md changes - should I send a pull request to the encryption branch, or to the master after this pr is merged? |
|
@ggershinsky, thanks a lot for testing it. Sorry, I've forgot that this one is not a simple PR that allows easy fixing of the findings. Please, create a JIRA about these two and fix it on master. |
Merges encryption feature