-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 #39528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o int32 Be defensive instead of writing invalid data.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix includes
Co-authored-by: Gang Wu <ustcwg@gmail.com>
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM!
| DataPageV1 over_compressed_limit(buffer, /*num_values=*/100, Encoding::BIT_PACKED, | ||
| Encoding::BIT_PACKED, Encoding::BIT_PACKED, | ||
| /*uncompressed_size=*/100); | ||
| EXPECT_THROW(pager->WriteDataPage(over_compressed_limit), ParquetException); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we uses EXPECT_THAT to gurantee "overflows to INT32_MAX"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/column_writer.cc
Outdated
|
|
||
| const uint8_t* output_data_buffer = compressed_data->data(); | ||
| if (compressed_data->size() > std::numeric_limits<int32_t>::max()) { | ||
| throw ParquetException("Compressed page size overflows to INT32_MAX."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we note that the page type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( Also can we print the page size? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
@emkornfield Could we move forward? I'll approve if comment fixed |
|
Yes, will try to update this week. |
|
@mapleFU apologies for the delay, I think I've addressed your feedback. |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
|
||
| const uint8_t* output_data_buffer = compressed_data->data(); | ||
| if (compressed_data->size() > std::numeric_limits<int32_t>::max()) { | ||
| throw ParquetException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one last suggestion
| ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX"))); | ||
| DictionaryPage dictionary_over_compressed_limit(buffer, /*num_values=*/100, | ||
| Encoding::PLAIN); | ||
| EXPECT_THROW(pager->WriteDictionaryPage(dictionary_over_compressed_limit), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't spot this, but perhaps also use EXPECT_THAT to check the exception message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this now
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 5d7f661. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 51 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…32 (#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes #39527 **This PR contains a "Critical Fix".** * Closes: #39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Be defensive instead of writing invalid data.
Rationale for this change
Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.
What changes are included in this PR?
Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).
Are these changes tested?
Unit tested
Are there any user-facing changes?
This might start raising exceptions instead of writing out invalid parquet files.
Closes #39527
This PR contains a "Critical Fix".