Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jun 18, 2025

Which issue does this PR close?

Rationale for this change

Infallible iteration is much easier to work with, vs. Iterator of Result or Result of Iterator. Iteration and validation are strongly correlated, because the iterator can only be infallible if the constructor previously validated everything the iterator depends on.

What changes are included in this PR?

In all three of VariantMetadata, VariantList, and VariantObject:

  • The header object is cleaned up to only consider actual header state. Other state is moved to the object itself.
  • Constructors fully validate the object by consuming a fallible iterator
  • The externally visible iterator does a map(Result::unwrap) on the same fallible iterator, relying on the constructor to prove the unwrap is safe.
  • The externally visible iterator is obtained by calling iter() method.

In addition:

  • VariantObject methods no longer materialize the whole offset+field array
  • Removed validation that is covered by the new iterator testing
  • A bunch of dead code removed, several methods renamed for clarity
  • first_byte_from_slice now returns u8 instead of &u8

Are there any user-facing changes?

Visibility and signatures of some methods changed.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 18, 2025
@scovich
Copy link
Contributor Author

scovich commented Jun 18, 2025

CC @alamb @mkarbo

}

pub(crate) fn first_byte_from_slice(slice: &[u8]) -> Result<&u8, ArrowError> {
pub(crate) fn first_byte_from_slice(slice: &[u8]) -> Result<u8, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small ergonomic improvement

Comment on lines +78 to +79
pub(crate) fn try_binary_search_range_by<K, E, F>(
range: Range<usize>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we don't have a slice we can search over.

Ranges are anyway more flexible, because the key extraction can always index into a slice if desired.

let mid = (left + right) / 2;
let key = key_extractor(&slice[mid])?;
while start < end {
let mid = start + (end - start) / 2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original average calculation was vulnerable to integer overflow

pub(crate) fn try_new(bytes: &[u8]) -> Result<Self, ArrowError> {
let header = first_byte_from_slice(bytes)?;

pub(crate) fn try_new(header: u8) -> Result<Self, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoist the first_byte_from_slice call to caller, to avoid the kinds of misuse that had been creeping in.

mut it: impl Iterator<Item = Result<T, E>>,
) -> Result<(), E> {
// NOTE: It should really be `let None = ...`, but the compiler can't prove that.
let _ = it.find(Result::is_err).transpose()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even tho this is a one-liner, it's arcane enough to be worth wrapping in a named+documented helper function.

i,
)
})
.collect::<Result<Vec<_>, _>>()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more collect calls!

Comment on lines -497 to -499
num_elements: usize,
first_offset_byte: usize,
first_value_byte: usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, moved out of the header struct because not actually part of the variant value header

.checked_add(value_bytes)
.ok_or_else(overflow)?;

// Verify that the last offset array entry is inside the value slice
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks are covered by the new iterator test

Comment on lines -431 to -438
let start_offset = field_offsets[index];
let end_offset = field_offsets[index + 1];
let value_bytes = slice_from_slice(
self.value,
self.header.values_start_byte + start_offset
..self.header.values_start_byte + end_offset,
)?;
let variant = Variant::try_new_with_metadata(self.metadata, value_bytes)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was highly redundant, so I rewrote it in terms of field_name and field methods.

matches!(err, ArrowError::InvalidArgumentError(ref msg)
if msg.contains("Index 2 out of bounds for dictionary of length 2")),
"unexpected error: {err:?}"
matches!(err, Err(ArrowError::InvalidArgumentError(_))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't obvious to me that the specific error string matters much -- we just want to verify that the bad access does not succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you are just following the pattern in the existing codebase, but one pattern I like for checking errors is

let err = md.get_offset(3).unwrap_err(); // panic's if not an Err
assert!(matches!(err, Err(ArrowError::InvalidArgumentError(_))));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... I changed that. Will revert back.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- I think this looks like a great improvement

I left several comments / suggestions but I don't think any of them are required to merge this PR. When you are happy with this one just let me know and I'll merge it in

Comment on lines 178 to 181
// Verify that `iter` can safely `unwrap` the items produced by this iterator.
//
// This has the side effect of validating the offset array and proving that the string bytes
// are all in bounds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also help to explain what this is doing in more "end user visible" terms

SOmething like

Suggested change
// Verify that `iter` can safely `unwrap` the items produced by this iterator.
//
// This has the side effect of validating the offset array and proving that the string bytes
// are all in bounds.
// Verify the contents of the dictionary: validate the offset array and proving that the string bytes
// are all in bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at it, PTAL.

self.bytes
}

pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to document the validation / expectations a bit. on the VariantMetadata structure and this try_new function

Something like

Suggested change
pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {
/// Create a new `VariantMetadata` header from the provided metadata bytes
///
/// # Validation
/// This method validates that `bytes` contains valid metadata
/// including checking that all offsets are in within bounds and
/// point to valid utf8 strings
pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {

Copy link
Contributor Author

@scovich scovich Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yipes... I just realized that the iterator-based validation in this PR does too much validation. It recursively descends into all child objects/arrays and validates all the way to leaf level. For deeply nested variant values, that could easily take on a quadratic flavor!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yipes... I just realized that the iterator-based validation in this PR does too much validation. It recursively descends into all child objects/arrays and validates all the way to leaf level. For deeply nested variant values, that could easily take on a quadratic flavor!

Maybe it is time for a "try_new_no_validation" or try_new_unchecked method that creates the variant without checking (and might panic on subsequent calls 🤔 ) -- then the parent could skip the validation when returning a Variant 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should proceed with this PR and address improving the validation as a follow on issue (perhaps with some benchmarks to guide us)

Copy link
Contributor Author

@scovich scovich Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn get(&self, i: usize) -> Result<&'m str, ArrowError> {
let dictionary_keys_bytes = slice_from_slice(self.bytes, self.dictionary_key_start_byte..)?;
let byte_range = self.get_offset(i)?..self.get_offset(i + 1)?;
string_from_slice(dictionary_keys_bytes, byte_range)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have already validated the strings on construction, this could in theory be an unchecked conversion (no need to revalidate utf8)

This is a future potential optimization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checked iterator currently calls this method, so we'd have to split up the code to distinguish first-touch (checked) vs. later touches (unchecked)?


/// Returns the number of key-value pairs in this object
pub(crate) fn num_elements(&self) -> usize {
pub fn len(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

matches!(err, ArrowError::InvalidArgumentError(ref msg)
if msg.contains("Index 2 out of bounds for dictionary of length 2")),
"unexpected error: {err:?}"
matches!(err, Err(ArrowError::InvalidArgumentError(_))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you are just following the pattern in the existing codebase, but one pattern I like for checking errors is

let err = md.get_offset(3).unwrap_err(); // panic's if not an Err
assert!(matches!(err, Err(ArrowError::InvalidArgumentError(_))));

/// Returns true if the object contains no key-value pairs
pub fn is_empty(&self) -> bool {
self.len() == 0
/// Returns an iterator of (name, value) pairs over the fields of this object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this API now mirrors the rust collections iterator

@alamb
Copy link
Contributor

alamb commented Jun 18, 2025

FYI @mkarbo and @PinkCrow007 @superserious-dev

@alamb
Copy link
Contributor

alamb commented Jun 18, 2025

I took the liberty of merging this PR up from main and fixing a logical conflict to get the CI to pass

@scovich
Copy link
Contributor Author

scovich commented Jun 18, 2025

I took the liberty of merging this PR up from main and fixing a logical conflict to get the CI to pass

@alamb -- When attempting to recompile locally, I noticed that MSRV for parquet-variant is now 1.83? Is that intentional?

@alamb
Copy link
Contributor

alamb commented Jun 18, 2025

I took the liberty of merging this PR up from main and fixing a logical conflict to get the CI to pass

@alamb -- When attempting to recompile locally, I noticed that MSRV for parquet-variant is now 1.83? Is that intentional?

It was necessary to get #7653 merged (specifically I think const / unwrapping a None or something)

You can remove the MSRV override and run

cargo msrv verify

to see the actual error

@scovich
Copy link
Contributor Author

scovich commented Jun 18, 2025

I took the liberty of merging this PR up from main and fixing a logical conflict to get the CI to pass

@alamb -- When attempting to recompile locally, I noticed that MSRV for parquet-variant is now 1.83? Is that intentional?

It was necessary to get #7653 merged (specifically I think const / unwrapping a None or something)

Ah ok. Just didn't want an accidental MSRV bump to slip past, given that the rest of arrow-rs still seems to be at 1.81

@scovich
Copy link
Contributor Author

scovich commented Jun 18, 2025

I'm happy with the PR at this point, other than the validation cost that needs follow-up. Merge at will?

@alamb alamb merged commit 20c1c34 into apache:main Jun 19, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 19, 2025

Thanks @scovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Improve API for iterating over values of a VariantList [Variant] Consider validating variants on creation (rather than read)

2 participants