-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++][Parquet] Segmentation fault reading modular encrypted Parquet dataset over 2^15 rows #39444
Description
Describe the bug, including details regarding any error messages, version, and platform.
Version: pyarrow==14.0.2
Platform: Linux 5.15.x x86_64 GNU/Linux
I have been trying out the new capability introduced in Arrow 14 which allows modular encryption of Parquet files to be used alongside the generic Dataset API, but began to experience segmentation faults in C++ Arrow code when working with real, partitioned datasets rather than toy examples.
The error is most often segmentation fault but I have seen all of these:
Segmentation fault
OSError: Failed decryption finalization
OSError: Couldn't set AAD
After some trial and error generating larger and larger toy examples until the error was reliably hit, I discovered that the threshold was when the data which is initially written to the dataset on disk with modular encryption was over 2^15. In testing, 2^15 rows never faulted, but 2^15 + 1 often faults (occasionally it does not).
The backtrace when triggering the fault or error always ends in the same function:
#9 0x00007f9b726625f2 in parquet::encryption::AesDecryptor::AesDecryptorImpl::GcmDecrypt(unsigned char const*, int, unsigned char const*, int, unsigned char const*, int, unsigned char*) [clone .cold] () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#10 0x00007f9b7270923e in parquet::Decryptor::Decrypt(unsigned char const*, int, unsigned char*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#11 0x00007f9b726f569b in void parquet::ThriftDeserializer::DeserializeMessage<parquet::format::ColumnMetaData>(unsigned char const*, unsigned int*, parquet::format::ColumnMetaData*, parquet::Decryptor*) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
#12 0x00007f9b726f8f34 in parquet::ColumnChunkMetaData::ColumnChunkMetaDataImpl::ColumnChunkMetaDataImpl(parquet::format::ColumnChunk const*, parquet::ColumnDescriptor const*, short, short, parquet::ReaderProperties const&, parquet::ApplicationVersion const*, std::shared_ptr<parquet::InternalFileDecryptor>) () from venv/lib/python3.10/site-packages/pyarrow/libparquet.so.1400
Corresponding to this source: https://github.com/apache/arrow/blob/main/cpp/src/parquet/encryption/encryption_internal.cc#L453
I found this previous fix in this area, the issue for which details the exact same symptoms: 88bccab
However, AesDecryptor::AesDecryptorImpl::GcmDecrypt() and AesDecryptor::AesDecryptorImpl::CtrDecrypt() use ctx_ member of type EVP_CIPHER_CTX from OpenSSL, which shouldn't be used from multiple threads concurrently.
So I was suspicious that it could be a multi-threading issue during read or write.
Whilst attempting to narrow down the cause (and whether the root cause occurs during writing or reading), I made the following observations:
- Writing
2^15 + 1rows, deleting half the dataset and then reading in the full dataset still encounters the error - Writing
2^15 + 1rows, and filtering to half the partitions in the dataset when reading still encounters the error - Writing
2^15 + 1rows, and decrypting with modular encryption each individual parquet file in the dataset and concatenating them never failed, only doing so as a full dataset - Writing or reading or both with
use_threads=Falsestill encounters the error
The fact that the issue still occurs when the dataset on disk is halved and then read again suggested corruption during write, but the fact that all individual Parquet files are still independently readable suggests an issue with modular decryption during Dataset operations. Issue still occurring with threading disabled was unexpected.
The error is reproducible using any random data, and using a KMS client which actually does no encryption at all (eliminating our custom KMS client as a probable cause), but simply passes the keys used around in (encoded) plaintext. It occurs whether the footer is plaintext or not, or whether envelope encryption is used or not.
Reproduction in Python here (no partitions needed at all, so it produces a single Parquet file, which can be read normally):
import base64
import numpy as np
import tempfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.parquet.encryption as pqe
class NoOpKmsClient(pqe.KmsClient):
def __init__(self):
super().__init__()
def wrap_key(self, key_bytes: bytes, _: str) -> bytes:
b = base64.b64encode(key_bytes)
return b
def unwrap_key(self, wrapped_key: bytes, _: str) -> bytes:
b = base64.b64decode(wrapped_key)
return b
row_count = pow(2, 15) + 1
table = pa.Table.from_arrays([pa.array(np.random.rand(row_count), type=pa.float32())], names=["foo"])
kms_config = pqe.KmsConnectionConfig()
crypto_factory = pqe.CryptoFactory(lambda _: NoOpKmsClient())
encryption_config = pqe.EncryptionConfiguration(
footer_key="UNIMPORTANT_KEY",
column_keys={"UNIMPORTANT_KEY": ["foo"]},
double_wrapping=True,
plaintext_footer=False,
data_key_length_bits=128,
)
pqe_config = ds.ParquetEncryptionConfig(crypto_factory, kms_config, encryption_config)
pqd_config = ds.ParquetDecryptionConfig(crypto_factory, kms_config, pqe.DecryptionConfiguration())
scan_options = ds.ParquetFragmentScanOptions(decryption_config=pqd_config)
file_format = ds.ParquetFileFormat(default_fragment_scan_options=scan_options)
write_options = file_format.make_write_options(encryption_config=pqe_config)
file_decryption_properties = crypto_factory.file_decryption_properties(kms_config)
with tempfile.TemporaryDirectory() as tempdir:
path = tempdir + "/test-dataset"
ds.write_dataset(table, path, format=file_format, file_options=write_options)
file_path = path + "/part-0.parquet"
new_table = pq.ParquetFile(file_path, decryption_properties=file_decryption_properties).read()
assert table == new_table
dataset = ds.dataset(path, format=file_format)
new_table = dataset.to_table()
assert table == new_tableAny help here would be much appreciated: being restricted to 2^15 rows is a roadblock for us for the use of this feature.
Component(s)
Parquet, Python