GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault#43071
GH-43070: [C++][Parquet] Check for valid ciphertext length to prevent segfault#43071mapleFU merged 18 commits intoapache:mainfrom
Conversation
|
|
|
Ah it looks like all the Windows builds are failing as I'm using non-exported classes in the new tests. Would it make sense to add |
| ss << "Invalid ciphertext length " << ciphertext_len << ". Expected at least " | ||
| << length_buffer_length_ + kNonceLength + kGcmTagLength << "\n"; | ||
| throw ParquetException(ss.str()); | ||
| } |
There was a problem hiding this comment.
Looking at the existing code in these methods, several things stand out:
- we don't validate ciphertext length before fetching the 4 bytes encoding the actual length (a corrupt file could perhaps have a ciphertext length < 4?)
- use of raw C arrays instead of
std::array<uint8_t>for example - why is the
ciphertext_lenargument optional in this API? this looks fickle and error-prone. - the part that extracts and validates the actual length is duplicated in the two methods
I would suggest we take the opportunity and refactor this into a cleaner and less error-prone implementation. In particular, the GCM and CTR-specific methods should probably have a mandatory ciphertext length, and would not have to bother with reading the length bytes.
@ggershinsky @thamht4190 Do we have an explanation for the very odd choices here?
There was a problem hiding this comment.
Yeah I didn't dig in too deep to find why the ciphertext_len is optional, it would be nice if that could be mandatory. But if that's not possible we should at least be able to provide the size of the buffer that the ciphertext is being read from to ensure that ciphertext_len isn't greater than this.
There was a problem hiding this comment.
It does look like we don't always know the length of the ciphertext but sometimes just an upper bound that's used to allocate the buffer. I found an example of this when reading the bloom filter header:
arrow/cpp/src/parquet/bloom_filter.cc
Lines 116 to 121 in e615a30
I'm thinking that rather than having separate arguments like ciphertext_buffer_len (required) and ciphertext_expected_len (optional), it's probably fine to make ciphertext_len required and mean the size of the buffer, so we would validate that the actual length is <= this after accounting for the 4 byte length rather than enforcing an exact match. Does that seem reasonable? (I've gone ahead and made this change but am happy to adjust the approach)
|
Another potential weak point is this: arrow/cpp/src/parquet/thrift_internal.h Lines 415 to 417 in 1da71ba and of course this part where we totally ignore the physical buffer length, letting the arrow/cpp/src/parquet/thrift_internal.h Lines 419 to 420 in 1da71ba All in all, this warrants a refactor for sanity and robustness. |
|
@pitrou, I think I've addressed all of your comments now thank you
I haven't touched the |
|
@adamreeve I just want to let you know that I'm currently sick and may not be able to review this before the next week. Thanks for doing this! |
|
OK no problem, thanks for letting me know, and I hope you're feeling better soon. |
mapleFU
left a comment
There was a problem hiding this comment.
The style looks ok but I'm not familiar with encryption
|
|
||
| #include "arrow/io/file.h" | ||
| #include "arrow/testing/gtest_compat.h" | ||
| #include "arrow/util/config.h" |
There was a problem hiding this comment.
Yes this is needed so that ARROW_WITH_SNAPPY is defined, otherwise the tests below are always skipped. This is a bit unrelated to this change but I noticed this problem when running the tests locally, and I'd come across this problem before in #40327.
wgtmac
left a comment
There was a problem hiding this comment.
Generally LGTM. Thanks for the fix!
| return reinterpret_cast<const uint8_t*>(cbytes); | ||
| } | ||
|
|
||
| inline ::arrow::util::span<const uint8_t> str2span(const std::string& str) { |
There was a problem hiding this comment.
It seems pretty common, does it make sense to relocate it to arrow/util/span.h?
There was a problem hiding this comment.
You can already construct a span from a string but that creates a span<const char>. Converting to a uint8_t span might be more specific to the encryption use case so I'm not sure about this.
|
|
||
| int Decrypt(const uint8_t* ciphertext, int ciphertext_len, const uint8_t* key, | ||
| int key_len, const uint8_t* aad, int aad_len, uint8_t* plaintext); | ||
| int Decrypt(::arrow::util::span<const uint8_t> ciphertext, |
There was a problem hiding this comment.
Should we add using ::arrow::util::span to this source file to make the signatures shorter?
There was a problem hiding this comment.
Good point, I've done that now
| std::vector<uint8_t> ciphertext(expected_ciphertext_len, '\0'); | ||
|
|
||
| int ciphertext_length = | ||
| encryptor.Encrypt(str2bytes(plain_text_), static_cast<int>(plain_text_.size()), |
There was a problem hiding this comment.
We can refactor this to use span in the follow up changes.
There was a problem hiding this comment.
Yes I've just made #43142 for this, and I will follow up after this PR
| } else { | ||
| if (ciphertext_len == 0) { | ||
| throw ParquetException("Zero ciphertext length"); | ||
| if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) { |
There was a problem hiding this comment.
| if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int>::max())) { | |
| if (ciphertext.size() > static_cast<size_t>(std::numeric_limits<int32_t>::max())) { |
| int aad_len, uint8_t* plaintext) { | ||
| int len; | ||
| int plaintext_len; | ||
| int AesDecryptor::PlaintextLength(int ciphertext_len) const { |
There was a problem hiding this comment.
Should we replace int with int32_t to be more portable? Same for functions below. Of course this can be a followup change.
There was a problem hiding this comment.
Yes it would make sense to do this as a follow up to avoid changing too much in this PR, I've made #43141 for this
| int AesDecryptor::AesDecryptorImpl::GetCiphertextLength( | ||
| ::arrow::util::span<const uint8_t> ciphertext) const { | ||
| if (length_buffer_length_ > 0) { | ||
| if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) { |
There was a problem hiding this comment.
It took me a while to understand this line. Perhaps it is better to explicitly use kBufferSizeLength here as line 484 to 486 have assumed this length is 4 bytes.
There was a problem hiding this comment.
Yes good point, that also confused me a bit. I've changed that and added a comment to hopefully make this more readable
| if (ciphertext.size() < static_cast<size_t>(length_buffer_length_)) { | ||
| std::stringstream ss; | ||
| ss << "Ciphertext buffer length " << ciphertext.size() | ||
| << " is insufficient to read the ciphertext length"; |
There was a problem hiding this comment.
nit: include the length (4 bytes) in the error message.
|
CI failures are unrelated. I will merge it shortly. |
|
@raulcd Please feel free to port it to maint-17.0.0 |
|
(Gang says he has api error when run merge script so I merged this, lol) |
|
Thanks both for jumping in on the review and thanks @adamreeve for the PR |
… segfault (#43071) ### Rationale for this change See #43070 ### What changes are included in this PR? Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type. Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer. ### Are these changes tested? Yes I've added new unit tests for this. ### Are there any user-facing changes? No * GitHub Issue: #43070 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit eadeb74. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 27 possible false positives for unstable benchmarks that are known to sometimes produce them. |
| ss << "Negative plaintext length " << plaintext_len; | ||
| throw ParquetException(ss.str()); | ||
| } | ||
| return plaintext_len + ciphertext_size_delta_; |
There was a problem hiding this comment.
We should ideally check for signed addition overflow here.
There was a problem hiding this comment.
Good point. Is it ok to add this to the changes in #43195, or should I make a separate PR for these follow ups?
There was a problem hiding this comment.
IMHO you can add them to the aforementioned PR.
| if (ciphertext_len > 0 && | ||
| ciphertext_len != (written_ciphertext_len + length_buffer_length_)) { | ||
| throw ParquetException("Wrong ciphertext length"); | ||
| if (written_ciphertext_len < 0) { |
There was a problem hiding this comment.
I wonder if the compiler won't elide this, as ((ciphertext[3] & 0xff) << 24) becoming negative implies signed integer overflow which is undefined behavior. @felipecrv What do you think?
There was a problem hiding this comment.
This check is weird. written_ciphertext_len should be uint32_t, then checking that written_ciphertext_len is not greater than std::numeric_limits<int32_t>::max() is enough. In other words: the encoded value is never negative, but it can be too big.
There was a problem hiding this comment.
ciphertext[3] & 0xff is weird as well, since ciphertext[3] is a uint8, thus in [0,255] already.
Perhaps the proper check should simply be ciphertext[3] < 0x80u?
There was a problem hiding this comment.
Hmm yeah the & 0xffs are all redundant, maybe it ended up written this way to be similar to the code for writing the length. Rather than just checking the length isn't negative, we should also check that written_ciphertext_len + length_buffer_length_ doesn't overflow, so I think it would probably be simplest to read as uint32_t.
| uint8_t tag[kGcmTagLength]; | ||
| memset(tag, 0, kGcmTagLength); |
There was a problem hiding this comment.
Or more idiomatically:
std::array<uint8_t, kGcmTagLength> tag{};There was a problem hiding this comment.
I've changed this as well as the other uses of C style arrays in ff31c7b
| AllocateBuffer(decryptor->pool(), | ||
| static_cast<int64_t>(clen - decryptor->CiphertextSizeDelta()))); | ||
| const uint8_t* cipher_buf = buf; | ||
| auto decrypted_buffer = std::static_pointer_cast<ResizableBuffer>(AllocateBuffer( |
There was a problem hiding this comment.
Unrelated, but since we're not attempting to resize the buffer, the cast to ResizableBuffer should be superfluous?
There was a problem hiding this comment.
AllocateBuffer here is ::parquet::AllocateBuffer which returns a std::shared_ptr<ResizableBuffer>, rather than ::arrow::AllocateBuffer, so it looks like the cast would be unnecessary even if we were resizing it:
arrow/cpp/src/parquet/platform.cc
Lines 36 to 37 in 031497d
There was a problem hiding this comment.
Hmm, I see. We can try removing it at some point then, as the code currently looks weird.
There was a problem hiding this comment.
I've removed this as well as another occurrence of the same behaviour in f4a7721
…pan instead of raw pointers (#43195) ### Rationale for this change See #43142. This is a follow up to #43071 which refactored the Decryptor API and added extra checks to prevent segfaults. This PR makes similar changes to the Encryptor API for consistency and better maintainability. ### What changes are included in this PR? * Change `AesEncryptor::Encrypt` and `Encryptor::Encrypt` to use `arrow::util::span` instead of raw pointers * Replace the `AesEncryptor::CiphertextSizeDelta` method with a `CiphertextLength` method that checks for overflow and abstracts the size difference behaviour away from consumer code for improved readability. ### Are these changes tested? * This is mostly a refactoring of existing code so is covered by existing tests. ### Are there any user-facing changes? No * GitHub Issue: #43142 Lead-authored-by: Adam Reeve <adreeve@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
See #43070
What changes are included in this PR?
Checks that the ciphertext length is at least enough to hold the length (if written), nonce and GCM tag for the GCM cipher type.
Also enforces that the input ciphertext length parameter is provided (is > 0) and verifies that the ciphertext size read from the file isn't going to cause reads beyond the end of the ciphertext buffer.
Are these changes tested?
Yes I've added new unit tests for this.
Are there any user-facing changes?
No