-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Speedup validation #7878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f2ff2a3 to
1a9ff6c
Compare
|
Note: this is still a POC. There's some code movement to do, but I figured it would be best to leave the diff like this to make it more readable. But the core concept is fully fleshed out. |
1a9ff6c to
b6af863
Compare
|
BTW I ran the test from this ticket locally before this PR Previously
With this PR Only 6.5 seconds 🤗 -- nice work |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting ideas. Made a partial pass with some ideas for improved control flow.
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this faster and/or simpler than try_into from slice to array?
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Iterator::is_sorted_by?
let offsets_monotonically_increasing = offset_chunks
.map(...)
.is_sorted_by(...);(again below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: The code below uses the materialized offsets.
But for large objects, wouldn't it be cheaper (and simpler) to combine the checks instead of materializing the offset array? Most likely inside a helper method so the code can leverage ?:
let mut offsets = offset_chunks.map(|chunk| match chunk {
...
});
let Some(mut prev_offset) = offset_chunks.next().transpose()? else {
return Ok(());
};
for offset in offsets {
let offset = offset?;
if prev_offset >= offset {
return Err(... not monotonic ...);
}
let value_bytes = slice_from_slice(value_buffer, prev_offset..offset)?;
prev_offset = offset;
let _ = Variant::try_new_with_metadata(self.metadata, value_bytes)?;
};
Ok(())
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, no need to materialize the iterator if we combine the checks into a single loop?
(but also see comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I expect it will be straightforward to avoid materializing these iterators
(again below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- it seems unecessary to allocate and copy all the values into a Vec here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm happy to change, but I think in this case the extra allocation isn’t hurting perf very much and it makes the code a lot simpler to reason about because we can split up the following checks better.
I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation
I only see one pass over field_ids? Two possible paths, but each is single-pass?
(the not-sorted-metadata case makes a second pass, but I'm pretty sure it's redundant and will never find a failure -- see other comment)
I think in this case the extra allocation isn’t hurting perf very much
Seems like something we could (should) quantify? For a not-small dictionary I would expect materialization cost to be quite noticeable. Anything big enough to fall out L1 cache would likely see a significant slowdown due to a wave of cache misses on every pass. And that includes not just the offset vec itself (easy enough to prefetch), but also the bytes of the strings and values we examine.
and it makes the code a lot simpler to reason about
I don't think that's necessarily true for the metadata and list cases above (see other comments).
For this case, it would look something like:
let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size);
if let Some(mut field_id) = field_ids.next() {
if self.metadata.is_sorted() {
// Since the metadata dictionary has unique and sorted field names...
for next_field_id in field_ids {
if next_field_id <= field_id {
return Err(... field names not sorted ...);
}
field_id = next_field_id;
}
} else {
// The metadata dictionary can't guarantee uniqueness...
... very similar to the metadata sortedness check: scan + validate_fallible_iterator ...
... probably could even factor out a helper function that both sites can call ...
}
if field_id >= self.metadata.dictionary_size() {
return Err(... field id is not valid ...);
}
}7a581d1 to
1f461da
Compare
26399ea to
9628c83
Compare
63fa322 to
646f8a4
Compare
|
Hi @viirya, I would appreciate your thoughts on this PR as well |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- this is making progress
I am not quite sure about the benchmarks
Also, have you had a chance to incorporate @scovich 's comment in https://github.com/apache/arrow-rs/pull/7878/files#r2190963902 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment seems incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, that is what I get from lifting this code from test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran these benchmarks locally
cargo bench --bench variant_validationHere is some output:
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 33.9s, or reduce sample count to 10.
Benchmarking bench_validate_large_object: Collecting 100 samples in estimated 33.859 s (100 iterations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a non trivial amount of the time being measured by this benchmark is the creation of the JSON string...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the generation of large object out of benchmark? Only benchmarking validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg -- that is embarassing. I will push up a fix soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found these benchmarks somewhat confusing -- they are mostly testing json --> Variant conversion I think.
While those are useful benchmarks, they probably shouldn't be called variant_validation
SO I suggest you split this benchmark up into two:
- parquet-variant/benches/variant_validation -- creates the
metadataandvalueprogrmatically withVariantBuilderonce and then benchmarks callingVariant::try_newon that (precreated) metadata/value - parquet-variant-json/benches/variant_json.rs -- creates a json string once and then benchmarks how fast calling
json_to_variantis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found these benchmarks somewhat confusing -- they are mostly testing json --> Variant conversion I think.
While those are useful benchmarks, they probably shouldn't be called variant_validation
Hi, the following benchmarks use iter_batched which doesn't include the startup time (in this case, building up the variant). The benchmark runs should only involve the validation code.
Reading this issue, it seems like iter_batched_ref is a better function to use.
As for the json_to_variant, I could've built up the variant manually via the builder, but found the json example quite convenient. Since we moved json_to_variant to a separate crate, I had no choice but to move the validation benchmarks here.
But I agree, we should move this back to parquet-variant. It is a bit unfortunate, but we can manually build up the objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much improvement the new crate adds ?
One thing I thought was that if we create a Variant from JSON we already know all the values were utf8, so it makes sense to skip validating it again befor read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between the currently used str::from_utf8 and this simdutf8::basic::from_utf8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this isn't the bottleneck so I'm fine with using str::from_utf8. I am a big fan of simd and am a sucker for potential vectorized operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet already has simdutf8 as (optional) dependency, so I think it makes sense to use it if enabled.
Line 73 in ff3a2f2
| simdutf8 = { version = "0.1.5", optional = true, default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- it seems unecessary to allocate and copy all the values into a Vec here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is faster than simply checking that all the field_ids are less than self.metadata.dictionary_size -- finding the max requires looking at all the items anywyas
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like there is a lot of duplicated code that does about the same thing:
Check that a slice of offsets are all
- sorted (potentially)--
- Less than some max offset
- Point into a valid sub variant
I wonder if it possible to factor it all out into a function like
fn validate_offsets(offset_buffer: &[u8], num_offsets: usize, offset_size: OffsetSize, max_valid_offset: usize) {
...
}Or something 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan on following up with a PR that removes this check since we can validate the monotonicity of offsets when accessing variants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this a new check? Seems exiting VariantMetadata's with_full_validation doesn't have this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the current validation was written before we supported sorted dictionaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| .map(|i| { | |
| let field_range = offsets[i - 1]..offsets[i]; | |
| value_str.get(field_range) | |
| }) | |
| .map(|i| value_str.get(offsets[i - 1]..offsets[i])) |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to not materialize id and offset arrays during validation -- there's no way it's free -- but maybe that can be a fast-follow PR?
Also -- some of the efficient iteration techniques the validation uses should probably be backported to the actual iterators?
parquet-variant/src/variant/list.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this check -- the loop below does offsets[i-1]..offsets[i] for every i in 1..offsets.len(), and any non-monotonic offset would cause slice_from_slice to return an Err like:?
Tried to extract byte(s) 42..25 from 100-byte buffer
Is it worth making an extra pass over the offsets (which requires materializing them) just to have a slightly nicer error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplified version would be something like:
let offsets = map_bytes_to_offsets(offset_buffer, self.header.offset_size);
if let Some(mut start) = offsets.next() {
for end in offsets {
... validate start..end ...
start = end;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm not sure it's worth paying an extra pass for a monotonicity check, when the slicing operations that follow will naturally fail in the presence of non-monotonic offsets?
let offsets = map_bytes_to_offsets(offset_bytes, self.header.offset_size);
if let Some(first_offset) = offsets.next() {
// Create an iterator over the strings. We must consume the iterator to validate it.
let strings = offsets.scan(first_offset, |start, end| {
let s = slice_from_slice(value_buffer, start..end);
*start = end;
Some(s)
});
if self.header.is_sorted {
// verify the strings are sorted and unique, in addition to being individually valid
if let Some(mut a) = strings.next().transpose()? {
for b in strings {
let b = b?;
if a >= b {
return Err(... dictionary values are not unique and ordered ...);
}
a = b;
}
}
} else {
// Just verify the strings are all individually valid
validate_fallible_iterator(strings)?;
}
}This time, an iterator scan works well because we naturally unpack the errors later.
Note: The above does require making slice_from_slice fully generic:
pub(crate) fn slice_from_slice<T, I: SliceIndex<[T]> + Clone + Debug>(
bytes: &[T],
index: I,
) -> Result<&I::Output, ArrowError> {
parquet-variant/src/decoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a doc comment explaining what it does and the conditions under which it's panic-free?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rustc optimizes and specializes super aggresively -- the specialization you show is for a single-element iterator -- it doesn't even have a loop. But agree that using chunks (rather than chunks_exact) means it shouldn't be able to panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, the screenshot should've showed the codegen for chunks_exact.
chunks_exact would never panic. If the slice length that we want to chunk over is not evenly divided by the chunk size, it will omit the remainder elements and can be retrieved by calling remainder() from the iterator.
More formally, ChunksExact will return array.len() // chunk_length elements. ChunksExact::remainder will return a slice with array.len() % chunk_length elements.
chunks would risk a panic if the slice length is not evenly divided by the chunk size. Since, the remainder is directly appended as the last slice in the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this check. If any field id were out of bounds, the collectat L245 would have already failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shallow validation already ensured the last offset is in bounds, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this not work for some reason?
| Ok::<_, ArrowError>(()) | |
| Ok(()) |
(try_for_each does seem to have a pretty weird collect-like signature, but I would have hoped that the previous ? would have fixed the error type...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation
I only see one pass over field_ids? Two possible paths, but each is single-pass?
(the not-sorted-metadata case makes a second pass, but I'm pretty sure it's redundant and will never find a failure -- see other comment)
I think in this case the extra allocation isn’t hurting perf very much
Seems like something we could (should) quantify? For a not-small dictionary I would expect materialization cost to be quite noticeable. Anything big enough to fall out L1 cache would likely see a significant slowdown due to a wave of cache misses on every pass. And that includes not just the offset vec itself (easy enough to prefetch), but also the bytes of the strings and values we examine.
and it makes the code a lot simpler to reason about
I don't think that's necessarily true for the metadata and list cases above (see other comments).
For this case, it would look something like:
let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size);
if let Some(mut field_id) = field_ids.next() {
if self.metadata.is_sorted() {
// Since the metadata dictionary has unique and sorted field names...
for next_field_id in field_ids {
if next_field_id <= field_id {
return Err(... field names not sorted ...);
}
field_id = next_field_id;
}
} else {
// The metadata dictionary can't guarantee uniqueness...
... very similar to the metadata sortedness check: scan + validate_fallible_iterator ...
... probably could even factor out a helper function that both sites can call ...
}
if field_id >= self.metadata.dictionary_size() {
return Err(... field id is not valid ...);
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if there are values like ..., None, Some(a), None, Some(b)..., is it possible that an unordered case like a >= b cannot be detected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, if is_sorted is false, it means the dictionary cannot guarantee both sortedness and uniqueness? If so, is it still a "dictionary"?
And if so, the name is_sorted is also confusing as it doesn't carry only sortedness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but all this naming comes from the variant spec 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @viirya, good question. From the specification:
If
sorted_stringsis set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, is field id started from 0 and all field ids are continuous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, per the spec:
A field_id is an index into the dictionary in the metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, existing full validation of VariantObject doesn't check if its field names are sorted, is it missing previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. The old validation was only ensuring individual fields names were valid, without comparing them for sortedness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field names sortness is different to field id sortness, right? We still can check if field_ids is sorted or not. If it is sorted, we can still just check the last field id?
|
I have sort of lost track of the current state of this PR. Is it something we want to merge and then do follow ons? If so, can we file some tickets to track those follow ons? Or do we want to keep working on this PR? I would like to improve the benchmarks to more cleanly separate the validation benchmarks from JSON parsing benchmarks. |
Hi, I would like to spend a bit more time on this. @scovich raised some good points about redundant checks which will remove the need to collect offsets. Here is my checklist for this PR:
The |
Honestly, this one feels like a good fast-follow, rather than something that would block the merge? Unless it's super easy/fast to clean up. Perfection as enemy of good and all that. This PR is already a drastic improvement over the current validation. And as @viirya pointed out, this PR adds some validations that were just plain missing before. Also, as a separate PR we could hopefully even see benchmark results quantifying the impact of the change. |
Sounds good. I'll get the benchmarks cleaned up and kick the rest over for another PR |
2fc9ed9 to
97610fc
Compare
97610fc to
d7b2596
Compare
|
Looks like there is one final CI check to fix and then I can merge this PR in https://github.com/apache/arrow-rs/actions/runs/16209770665/job/45767524794?pr=7878 I can't push to your PR directly because I don't have permissions in the pydantic fork If you make PRs from your own fork in the future, I can make changes directly if you think that will help |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @friendlymatthew -- I think this is an improvement and it gets us setup for more improvements (the benchmark, etc)
Thus I think we should merge it as soon as we can get the CI to pass
Thanks again @scovich @viirya and @Dandandan for all the help
Hi, I've been trying to figure out why CI is failing. Looking at the warnings, a lot of it is coming from the The warnings related to Fwiw, running |
|
For example, the latest CI run displays errors like:
error: could not document |
Looks to me like the latest nightly rust got more strict and for some reason we build docs with nightly in this crate So to reproduce locally I bet you can do rustup toolchain install nightly
cargo +nightly doc --document-private-items --no-deps --workspace --all-featuresSo in other words it is not related to your PR and CI would likely fail on main if we re-ran it |
|
@viirya fixed the docs CI failure in this PR ❤️ So I am merging this PR in to keep the code flowing |
|
Thanks again. @friendlymatthew can you make sure any follow on tasks we identified are properly tracked with tickets? Thank you again @scovich @viirya and @Dandandan -- this one took a bit but I think we are on the right track |


Rationale for this change
test_json_to_variant_object_very_largetakes over 20s #7872This PR contains algorithmic modifications to the validation logic and the associated benchmarks, specifically targeting complex object and list validation.
Previously, the approach involved iterating over each element and repeatedly fetching the same slice of the backing buffer, then slicing into that buffer again for each individual element. This led to redundant buffer access.
This validation approach is done in multiple passes that take advantage of the variant's memory layout. For example, dictionary field names are stored contiguously; instead of checking whether a field name is UTF8-encoded separately, we now validate the entire field name buffer in a single pass.
The benchmark cases were adapted from
test_json_to_variant_object_very_large,test_json_to_variant_object_complex, andtest_json_to_variant_array_nested_largetest cases.Compared to #7871, we observe a significant improvement in performance:
@scovich @alamb