-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Make variant iterators safely infallible #7704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| pub(crate) fn first_byte_from_slice(slice: &[u8]) -> Result<&u8, ArrowError> { | ||
| pub(crate) fn first_byte_from_slice(slice: &[u8]) -> Result<u8, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small ergonomic improvement
| pub(crate) fn try_binary_search_range_by<K, E, F>( | ||
| range: Range<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we don't have a slice we can search over.
Ranges are anyway more flexible, because the key extraction can always index into a slice if desired.
| let mid = (left + right) / 2; | ||
| let key = key_extractor(&slice[mid])?; | ||
| while start < end { | ||
| let mid = start + (end - start) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original average calculation was vulnerable to integer overflow
parquet-variant/src/variant.rs
Outdated
| pub(crate) fn try_new(bytes: &[u8]) -> Result<Self, ArrowError> { | ||
| let header = first_byte_from_slice(bytes)?; | ||
|
|
||
| pub(crate) fn try_new(header: u8) -> Result<Self, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoist the first_byte_from_slice call to caller, to avoid the kinds of misuse that had been creeping in.
parquet-variant/src/utils.rs
Outdated
| mut it: impl Iterator<Item = Result<T, E>>, | ||
| ) -> Result<(), E> { | ||
| // NOTE: It should really be `let None = ...`, but the compiler can't prove that. | ||
| let _ = it.find(Result::is_err).transpose()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even tho this is a one-liner, it's arcane enough to be worth wrapping in a named+documented helper function.
| i, | ||
| ) | ||
| }) | ||
| .collect::<Result<Vec<_>, _>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more collect calls!
| num_elements: usize, | ||
| first_offset_byte: usize, | ||
| first_value_byte: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, moved out of the header struct because not actually part of the variant value header
| .checked_add(value_bytes) | ||
| .ok_or_else(overflow)?; | ||
|
|
||
| // Verify that the last offset array entry is inside the value slice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks are covered by the new iterator test
| let start_offset = field_offsets[index]; | ||
| let end_offset = field_offsets[index + 1]; | ||
| let value_bytes = slice_from_slice( | ||
| self.value, | ||
| self.header.values_start_byte + start_offset | ||
| ..self.header.values_start_byte + end_offset, | ||
| )?; | ||
| let variant = Variant::try_new_with_metadata(self.metadata, value_bytes)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was highly redundant, so I rewrote it in terms of field_name and field methods.
parquet-variant/src/variant.rs
Outdated
| matches!(err, ArrowError::InvalidArgumentError(ref msg) | ||
| if msg.contains("Index 2 out of bounds for dictionary of length 2")), | ||
| "unexpected error: {err:?}" | ||
| matches!(err, Err(ArrowError::InvalidArgumentError(_))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't obvious to me that the specific error string matters much -- we just want to verify that the bad access does not succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you are just following the pattern in the existing codebase, but one pattern I like for checking errors is
let err = md.get_offset(3).unwrap_err(); // panic's if not an Err
assert!(matches!(err, Err(ArrowError::InvalidArgumentError(_))));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... I changed that. Will revert back.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich -- I think this looks like a great improvement
I left several comments / suggestions but I don't think any of them are required to merge this PR. When you are happy with this one just let me know and I'll merge it in
parquet-variant/src/variant.rs
Outdated
| // Verify that `iter` can safely `unwrap` the items produced by this iterator. | ||
| // | ||
| // This has the side effect of validating the offset array and proving that the string bytes | ||
| // are all in bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also help to explain what this is doing in more "end user visible" terms
SOmething like
| // Verify that `iter` can safely `unwrap` the items produced by this iterator. | |
| // | |
| // This has the side effect of validating the offset array and proving that the string bytes | |
| // are all in bounds. | |
| // Verify the contents of the dictionary: validate the offset array and proving that the string bytes | |
| // are all in bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at it, PTAL.
| self.bytes | ||
| } | ||
|
|
||
| pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to document the validation / expectations a bit. on the VariantMetadata structure and this try_new function
Something like
| pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> { | |
| /// Create a new `VariantMetadata` header from the provided metadata bytes | |
| /// | |
| /// # Validation | |
| /// This method validates that `bytes` contains valid metadata | |
| /// including checking that all offsets are in within bounds and | |
| /// point to valid utf8 strings | |
| pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yipes... I just realized that the iterator-based validation in this PR does too much validation. It recursively descends into all child objects/arrays and validates all the way to leaf level. For deeply nested variant values, that could easily take on a quadratic flavor!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yipes... I just realized that the iterator-based validation in this PR does too much validation. It recursively descends into all child objects/arrays and validates all the way to leaf level. For deeply nested variant values, that could easily take on a quadratic flavor!
Maybe it is time for a "try_new_no_validation" or try_new_unchecked method that creates the variant without checking (and might panic on subsequent calls 🤔 ) -- then the parent could skip the validation when returning a Variant 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should proceed with this PR and address improving the validation as a follow on issue (perhaps with some benchmarks to guide us)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub fn get(&self, i: usize) -> Result<&'m str, ArrowError> { | ||
| let dictionary_keys_bytes = slice_from_slice(self.bytes, self.dictionary_key_start_byte..)?; | ||
| let byte_range = self.get_offset(i)?..self.get_offset(i + 1)?; | ||
| string_from_slice(dictionary_keys_bytes, byte_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we have already validated the strings on construction, this could in theory be an unchecked conversion (no need to revalidate utf8)
This is a future potential optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checked iterator currently calls this method, so we'd have to split up the code to distinguish first-touch (checked) vs. later touches (unchecked)?
|
|
||
| /// Returns the number of key-value pairs in this object | ||
| pub(crate) fn num_elements(&self) -> usize { | ||
| pub fn len(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
parquet-variant/src/variant.rs
Outdated
| matches!(err, ArrowError::InvalidArgumentError(ref msg) | ||
| if msg.contains("Index 2 out of bounds for dictionary of length 2")), | ||
| "unexpected error: {err:?}" | ||
| matches!(err, Err(ArrowError::InvalidArgumentError(_))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you are just following the pattern in the existing codebase, but one pattern I like for checking errors is
let err = md.get_offset(3).unwrap_err(); // panic's if not an Err
assert!(matches!(err, Err(ArrowError::InvalidArgumentError(_))));| /// Returns true if the object contains no key-value pairs | ||
| pub fn is_empty(&self) -> bool { | ||
| self.len() == 0 | ||
| /// Returns an iterator of (name, value) pairs over the fields of this object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this API now mirrors the rust collections iterator
|
FYI @mkarbo and @PinkCrow007 @superserious-dev |
|
I took the liberty of merging this PR up from main and fixing a logical conflict to get the CI to pass |
@alamb -- When attempting to recompile locally, I noticed that MSRV for parquet-variant is now 1.83? Is that intentional? |
It was necessary to get #7653 merged (specifically I think const / unwrapping a You can remove the MSRV override and run cargo msrv verifyto see the actual error |
Ah ok. Just didn't want an accidental MSRV bump to slip past, given that the rest of arrow-rs still seems to be at 1.81 |
|
I'm happy with the PR at this point, other than the validation cost that needs follow-up. Merge at will? |
|
Thanks @scovich |
Which issue does this PR close?
Rationale for this change
Infallible iteration is much easier to work with, vs. Iterator of Result or Result of Iterator. Iteration and validation are strongly correlated, because the iterator can only be infallible if the constructor previously validated everything the iterator depends on.
What changes are included in this PR?
In all three of
VariantMetadata,VariantList,andVariantObject:map(Result::unwrap)on the same fallible iterator, relying on the constructor to prove the unwrap is safe.iter()method.In addition:
VariantObjectmethods no longer materialize the whole offset+field arrayfirst_byte_from_slicenow returnsu8instead of&u8Are there any user-facing changes?
Visibility and signatures of some methods changed.