Skip to content

PARQUET-1884: Encryption branch merge in master#800

Merged
gszadovszky merged 7 commits intomasterfrom
encryption
Jul 29, 2020
Merged

PARQUET-1884: Encryption branch merge in master#800
gszadovszky merged 7 commits intomasterfrom
encryption

Conversation

@ggershinsky
Copy link
Copy Markdown
Contributor

Merges encryption feature

@gszadovszky
Copy link
Copy Markdown
Contributor

@ggershinsky, could you resolve the conflicts?

ggershinsky and others added 7 commits July 28, 2020 12:32
…rsion (#782)

Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
* comments
* update key tools
* double wrap for minimizing KMS calls
* Add information about KMS instance ID to footer metadata. Then on file reading, KMS instance ID doesn't have to be provided in properties, but can be read from the metadata.
Add RemoteKmsClient abstract class to assist implementing KMSClients for remote KMSs, that are accessed using URL.
Make DoubleWrappedKeyManager inherit from WrappedKeyManager and make FileKeyManager an abstract class. Add a static factory method to FileKeyManager to initialize an appropriate KSMClient and Key manager.
KMS URL should be specified in properties either directly or in a list. KMS instance ID is either default, or should be specified in properties or read from footer metadata.
* major update - key rotation, crypto factory, etc
* Change caches of EnvelopeKeyManager and EnvelopeKeyRetriever to be per token.
KmsClient is per token and read/write KEK caches too.
Add default token value for InMemoryKMS, which has no tokens.
Use concurrentHashMap for caches with computeIfAbsent.
Add expiration using to the caches - both time-based and on-demand.
On expiration delete the per-token entries from caches.
Add method for cache invalidation per token.
Add abstract methods to be implemented by RemoteKmsClients.
* add in-memory KMS
* Change RemoteKmsClient exceptions to IOException
instead of the higher-level ParquetCryptoRuntimeException.
Change to constant names to uppercase.
* Add sample VaultClient.
* interface changes
* Add okHttp3 dependency for VaultClient sample.
* wrapping changes
* Use JSON serialization for key material.
* separate write and read path, update caching
* improved refactoring
* key rotation improvements
* Add TestPropertiesDrivenEncryption
* get and resfresh token for all KMS clients
* minor changes
* Use ConcurrentHashMap for caches
* caching and store updates
* Rename some encryption/decryption configurations and make the test parameterized
to test combinations of isKeyMaterialExternalStorage, isDoubleWrapping, isWrapLocally.
Add RemoteKmsClient mock for remote wrapping.
* add removeCacheEntriesForAllTokens
* Make common method setCommonKMSProperties and extract classname strings from classes
* Change TestPropertiesDrivenEncryption to accomodate latest API changes.
* Remove StringUtils
* address review comments
* key material documentation
* Boolean objects

Co-authored-by: Maya Anderson <mayaa@il.ibm.com>
Copy link
Copy Markdown
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A have some minor findings.

One more thing that we failed to check before. It is the CRC calculation. (See TestDataPageV1Checksums and the thrift file for details.) How does the CRC work in case of encrypted pages? Is the definition in the thrift file still stand (CRC is calculated after compression but before encryption)?
At this point I am not requesting any additional implementation but we shall check at least that the CRC calculation also works fine with encryption.

@@ -1,4 +1,5 @@
language: java
jdk: openjdk8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jdk is already listed, this one is not required

---

**Property:** `parquet.crypto.factory.class`
**Description:** Class implementing EncryptionPropertiesFactory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add double spaces to the end of the lines to enforce newlines. (Maybe it was my mistake during the rebase that they were removed.)
(You may check the result by reviewing the rendering of the file on the github UI.)

@ggershinsky
Copy link
Copy Markdown
Contributor Author

A have some minor findings.

One more thing that we failed to check before. It is the CRC calculation. (See TestDataPageV1Checksums and the thrift file for details.) How does the CRC work in case of encrypted pages? Is the definition in the thrift file still stand (CRC is calculated after compression but before encryption)?
At this point I am not requesting any additional implementation but we shall check at least that the CRC calculation also works fine with encryption.

Per https://github.com/apache/parquet-mr/pull/800/files#diff-d007a18083a2431c30a5416f248e0a4bR169 and https://github.com/apache/parquet-mr/pull/800/files#diff-d007a18083a2431c30a5416f248e0a4bR183, CRC is calculated after compression and encryption of page bytes. Which makes sense, because these are the bytes that are written to the disk. The Thrift comment says CRC is calculated after compression (doesn't mention encryption), but we should have modified this comment. One more point - AEG_GCM mode verifies page integrity (in CPU hardware), so CRC is not needed in this encryption mode.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

AEG_GCM mode verifies page integrity (in CPU hardware), so CRC is not needed in this encryption mode.

Btw, this is the default encryption mode. In Java 9 and later, there is no reason to use encryption and CRC together.

@gszadovszky
Copy link
Copy Markdown
Contributor

Currently CRC check is configurable by hadoop conf parquet.page.verify-checksum.enabled which is disabled by default. I accept there is no reason to use both encryption and crc together but as the user can configure that we shall ensure it works properly.

@ggershinsky
Copy link
Copy Markdown
Contributor Author

Per the sync discussion,

  • no reason to use both encryption and CRC even in Java 8 or older (GCM overhead is due to integrity verification; same issue as with CRC)
  • I'll run a quick test (a variation of the TestDataPageV1Checksums), to make sure Parquet works properly if a user turns both features on, as allowed by the API/config (no need to forbid this combination)

@ggershinsky
Copy link
Copy Markdown
Contributor Author

@gszadovszky TestDataPageV1Checksums didn't work out, because it uses an old parquet API, with deprecated methods we've decided not to extend for encryption. So instead, I've run the ColumnIndexFiltering test, that uses the valid parquet API. I've modified it to add encryption/decryption, and enablePageWriteChecksum()/usePageChecksumVerification(true) to this test, everything worked ok.

Regarding the travis.yaml and hadoop/readme.md changes - should I send a pull request to the encryption branch, or to the master after this pr is merged?

@gszadovszky
Copy link
Copy Markdown
Contributor

@ggershinsky, thanks a lot for testing it.

Sorry, I've forgot that this one is not a simple PR that allows easy fixing of the findings. Please, create a JIRA about these two and fix it on master.
I'm merging this PR.

@gszadovszky gszadovszky merged commit 65b95fb into master Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants