Skip to content

GH-39444: [C++][Parquet] Fix crash in Modular Encryption#39623

Merged
pitrou merged 2 commits intoapache:mainfrom
tolleybot:issue-39444-dataset-encryption
Mar 13, 2024
Merged

GH-39444: [C++][Parquet] Fix crash in Modular Encryption#39623
pitrou merged 2 commits intoapache:mainfrom
tolleybot:issue-39444-dataset-encryption

Conversation

@tolleybot
Copy link
Copy Markdown
Contributor

@tolleybot tolleybot commented Jan 15, 2024

Rationale for this change:

This pull request addresses a critical issue (GH-39444) in the C++/Python components of Parquet, specifically a segmentation fault occurring when processing encrypted datasets over 2^15 rows. The fix involves modifications in cpp/src/parquet/encryption/internal_file_decryptor.cc, particularly in InternalFileDecryptor::GetColumnDecryptor. The caching of the Decryptor object was removed to resolve the multithreading issue causing the segmentation fault and encryption failures.

What changes are included in this PR?

  • Removal of Decryptor object caching in InternalFileDecryptor::GetColumnDecryptor.
  • Addition of two unit tests: large_row_parquet_encrypt_test.cc for C++ and an update to test_dataset_encryption.py with test_large_row_encryption_decryption for Python.

Are these changes tested?

Yes, the unit tests (large_row_parquet_encrypt_test.cc and test_large_row_encryption_decryption in test_dataset_encryption.py) have been added to ensure the reliability and effectiveness of these changes.

Are there any user-facing changes?

No significant user-facing changes, but the update significantly improves the backend stability and reliability of Parquet file handling. Calling DecryptionKeyRetriever::GetKey could be an expensive operation potentially involving network calls to key management servers.

@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@kou kou changed the title Segmentation Fault in Modular Encryption GH-39444: [C++][Parquet] Fix Segmentation Fault in Modular Encryption Segmentation Fault in Modular Encryption Jan 16, 2024
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39444 has been automatically assigned in GitHub to PR creator.

@tolleybot
Copy link
Copy Markdown
Contributor Author

Looks like I am still getting a segmentation fault randomly in InternalFileDecryptor when WipeOut is called. Ill look into seeing if I can add a mutex around all_decryptors_

@tolleybot
Copy link
Copy Markdown
Contributor Author

It looks like all but the C++ / AMD64 Windows 2019 C++17 AVX2 build was completed without issue. I am not sure what the issue is with the C++ / AMD64 Windows 2019 C++17 AVX2, build but it seems unrelated.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough investigation and quick fix!

cc @emkornfield @pitrou

@wgtmac wgtmac requested a review from mapleFU January 17, 2024 02:19
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 17, 2024
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jan 29, 2024

The C++ side looks good to me. I have also refactored the C++ tests.

Could you take a look as well? @jorisvandenbossche @westonpace @pitrou

@wgtmac wgtmac changed the title GH-39444: [C++][Parquet] Fix Segmentation Fault in Modular Encryption Segmentation Fault in Modular Encryption GH-39444: [C++][Parquet] Fix Segment Fault in Modular Encryption Feb 27, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Feb 28, 2024
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Mar 12, 2024

@adamreeve @westonpace Do you want to take another pass?

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac no additional concerns on my end

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Mar 12, 2024
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 12, 2024

@github-actions crossbow submit -g python -g cpp

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work on builds where Parquet is disabled. Let's watch for CI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not have an actual test build without parquet .. (but indeed in general we always import pyarrow.parquet within a try/except to let this run on versions without parquet support as well)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought we did...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just deducting that from the fact we got no failures in the builds you triggered.
But looking at our tasks.yml, it seems the python minimal builds are not included in the "python" group ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be why :-(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing and potentially fixing in #40505 (and at once adding the minimal tests to the python group)

@github-actions

This comment was marked as outdated.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 12, 2024

Question: the caching was probably there for a reason, is it ok to remove it? Would it be beneficial to reimplement it in a thread-safe way? (the PR description mentions network requests)

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 12, 2024

Hmm, I was assuming the Cython compatibility issues were fixed? @raulcd

@raulcd
Copy link
Copy Markdown
Member

raulcd commented Mar 12, 2024

Hmm, I was assuming the Cython compatibility issues were fixed? @raulcd

That was my understanding, isn't this the fix? #40386
I've added the above to 15.0.2.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 12, 2024

Ah, that's because this PR needs rebasing. @tolleybot Can you please rebase/merge on git main?

…void issues in multithreaded environment. Removed unsed column_metatdata_map_ and column_data_map_ class attributes

Updates to include new test for python and c++ that test a large number of rows used for encryption and decryption of a dataset

adding header to unit test

adding a mutex to guard EVP_CIPHER_CTX in AesDecryptorImpl

update for formatting

removed warning concerning type conversion from double to int64

adding a mutex to guard EVP_CIPHER_CTX in AesDecryptorImpl.  My previous commit actually did the same for AesEncryptorImpl

Moved mutex to InternalFileDecryptor for protection of :vector<std::weak_ptr<encryption::AesDecryptor>> all_decryptors_.  This seems to have fixed the issue

Update cpp/src/parquet/encryption/internal_file_decryptor.cc

Co-authored-by: Gang Wu <ustcwg@gmail.com>

removed unneeded #include <mutex>

This commit includes the following changes to the InternalFileDecryptor class:

- Moved the `mutex_` declaration closer to `all_decryptors_`. This improves the readability of the code and makes it clear that `mutex_` is primarily used for protecting the `all_decryptors_` vector.

- Made `mutex_` mutable. This allows the mutex to be locked in `const` member functions

- Added a `GUARDED_BY(mutex_)` comment to `all_decryptors_`. This explicitly documents that access to `all_decryptors_` is protected by `mutex_`

Removed lock in InternalFileDecryptor::WipeOutDecryptionKey().
Made mutex mutable.

Update python/pyarrow/tests/test_dataset_encryption.py

Co-authored-by: Adam Reeve <adreeve@gmail.com>

Update python/pyarrow/tests/test_dataset_encryption.py

Co-authored-by: Adam Reeve <adreeve@gmail.com>

- adding a block to limit the scope of our lock from lines 147-152 in internal_file_decryptor.cc
- changed #define ROW_COUNT 32769 to use constexpr in large_row_parquet_encrypt_test.cc

simplify tests

included a lock in WipeOutDecryptionKeys

Update python/pyarrow/tests/test_dataset_encryption.py

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

removing unneeded init
@tolleybot tolleybot force-pushed the issue-39444-dataset-encryption branch from 8677414 to c8f2cc4 Compare March 12, 2024 17:55
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 12, 2024

@github-actions crossbow submit -g python -g cpp

@github-actions
Copy link
Copy Markdown

Revision: c8f2cc4

Submitted crossbow builds: ursacomputing/crossbow @ actions-c569d19f76

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-cpp GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-debian-11-python-3-amd64 Azure
test-debian-11-python-3-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-fedora-39-python-3 Azure
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @tolleybot . I'm going to merge after the latest update.

@pitrou pitrou changed the title GH-39444: [C++][Parquet] Fix Segment Fault in Modular Encryption GH-39444: [C++][Parquet] Fix crash in Modular Encryption Mar 13, 2024
@pitrou pitrou merged commit dd6d728 into apache:main Mar 13, 2024
@pitrou pitrou removed the awaiting merge Awaiting merge label Mar 13, 2024
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 13, 2024

@raulcd Not a blocker, but if you ever need to do a RC3, this would be a good candidate for inclusion :-)

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit dd6d728.

There were 8 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

jorisvandenbossche added a commit that referenced this pull request Mar 14, 2024
Small follow-up on #39623 fixing am import in the test.

But first testing here if it actually fails.

* GitHub Issue: #39444

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Segmentation fault reading modular encrypted Parquet dataset over 2^15 rows

8 participants