-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Support creating sorted dictionaries #7833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
34fe046 to
ede73d2
Compare
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no field names in the variant, we'll mark it as not sorted.
The spec leaves this pretty ambiguous..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine. Especially since sortedness only matters if there are enough field names that binary search beats linear search (which usually starts around 10-20 entries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand... why do the extra work to check for is_empty? The is_sorted will return a correct result either way.
ede73d2 to
48d58ee
Compare
I just merged that PR so we can probably rebase this one now |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- this is looking good. I left some comments -- let me know
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to give some more context here like
/// This method pre-populates the field name dictionary in the Variant Metadata
/// with the specified field names, in order.
///
/// You can use this to prepopulate a VariantBuilder with a sorted dictionary if
/// you know the field names before hand somehow. Sorted dictionaries can accelerate
/// field access when reading Variants.
I also think it would be great to add a note and a link to the documentation of VariantBuilder::build telling about this method
...
/// Note: the builder checks if the metadata dictionary is sorted and if it is
/// marks the metadata as sorted, which can accelerate reading Variant values.
/// Field names are added to the metadata dictionary in the order they are first
/// encoutered. See [`Self::with_field_names`] to pre-populate field names in
/// order if you know beforehand the fields that can appear in your objects
fn build(self) -> ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, thank you for your suggestion. I need to get better at my prose!
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am don't think it is important to verify these internal details of the field_ids -- they key thing is to verify that the object is sorted if the keys are pre-added in sorted order
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here, I don't think checking the field ids is important.
48d58ee to
2e1bc84
Compare
c026a95 to
dc2221b
Compare
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition! Some thoughts about how to make the API cleaner/safer.
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field_names?
Also, why impl Iterator instead of the more common impl IntoIterator, out of curiosity? Can also take items that impl Into<String>, for additional flexibility.
| fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { | |
| Self { | |
| field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), | |
| } | |
| fn from_field_names<'a>(field_names: impl IntoIterator<Item = impl Into<String>>) -> Self { | |
| let field_names = field_names.into_iter().map(Into::into).collect(); | |
| Self { field_names } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- would it make sense to provide an impl FromIterator for MetadataBuilder, instead of or in addition to this constructor? Then people could build a new instance in all the usual rustic ways, e.g.
let builder = MetadataBuilder::from_iter(["a", "b", "c"]);or
self.metadata_builder = ["a", "b", "c"].into_iter().collect();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree switching to IntoIterator would be nicer as would implemeting Extend
If we don't do it in this PR, let's file a ticket to fix follow on PRs
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: On the off-chance somebody ever managed to insert more than 2**32-1 entries, this would silently produce wrong results. We should consider returning usize and require the caller to do the conversion, since they have a better chance of being able to handle the issue cleanly.
On the other hand, we have two potential sources of "protection" already:
- Most machines will run out of memory long before they hit u32 overflow, and that will panic. So it would be hard to trigger the bug in the first place.
num_field_namesreturnsusize. So we could make the builder'sfinishmethod fallible and then check for u32 overflow only once at the end. Downside is, then we did all the work to insert those extra field names even tho we're doomed to fail later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option could be to change the code to panic in this unlikely case and deal with it when / if we have an actual issue
As you point out, I don't think this is related to this PR
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead?
Although the total number of operations is the same, it should be cheaper because the bytes of a freshly inserted field name will still be in CPU cache (having just been initialized and/or hashed). So the string comparison will be a lot faster than if it takes a cache miss after not having been accessed for a long time.
Details
The builder would default to is_sorted=true, but switches to false if any newly inserted field name breaks ordering:
fn upsert_field_name(&mut self, field_name: &str) -> u32 {
let (id, inserted) = self.field_names.insert_full(field_name.to_string());
if inserted {
let n = self.field_names.len();
if n >= 2 && self.field_names[n-2] > self.field_names[n-1] {
self.is_sorted = false;
}
}
id as u32
}And then when initializing from an iterator, just insert each entry manually in a loop
let mut new_self = Self::new();
for field_name in field_names {
let _ = new_self.upsert_field_name(field_name);
}
new_selfActually, if we impl Extend for MetadataBuilder, then the initialization becomes even simpler:
let mut new_self = Self::new();
new_self.extend(field_names);
new_selfThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead?
I am not sure it is all that expensive
https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.is_sorted
I think it simply checks all the elements pariwise (code)
I don't think it is any more/less expensive in theory to check the dictionary on write than to do the checks incrementally as the variant is written. In each case it needs to do N-1 comparisons where N is the number of dictionary entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it comes down to CPU caches and memory bandwidth, for larger dictionaries and/or longer strings. Technically constant factors but not necessarily ones we should ignore.
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would wipe out any previously added field names, which seems like an unnecessary footgun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, similar to above -- would it make sense to impl Extend for MetadataBuilder and then extend here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing the metadata builder also invalidates any previously created objects which seems likely to be pretty bad
I agree we should change this to append to (not replace) existing fields
pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self {
for field_name in field_names {
self.metadata_builder.upsert_field_name(field_name)
}
self
}
The Extend is an excellent idea as well, but it is less clear to me that Extend on a VariantBuilder should only extend fields -- I would expect Extend to take Variants and logically extend the in-progress variant 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, Extend (and FromIterator) suggestion was only for the underlying MetadataBuilder. Agree it would be weird to have one for field names only at variant builder level!
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich and @friendlymatthew
from my perspective the only thing that needs to be fixed are:
- with_field_names should append rather than replace
@scovich 's idea of dding Extend is an excellent idea as well, but I think it could be done as a follow on PR as well
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead?
I am not sure it is all that expensive
https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.is_sorted
I think it simply checks all the elements pariwise (code)
I don't think it is any more/less expensive in theory to check the dictionary on write than to do the checks incrementally as the variant is written. In each case it needs to do N-1 comparisons where N is the number of dictionary entries
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option could be to change the code to panic in this unlikely case and deal with it when / if we have an actual issue
As you point out, I don't think this is related to this PR
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing the metadata builder also invalidates any previously created objects which seems likely to be pretty bad
I agree we should change this to append to (not replace) existing fields
pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self {
for field_name in field_names {
self.metadata_builder.upsert_field_name(field_name)
}
self
}
The Extend is an excellent idea as well, but it is less clear to me that Extend on a VariantBuilder should only extend fields -- I would expect Extend to take Variants and logically extend the in-progress variant 🤔
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree switching to IntoIterator would be nicer as would implemeting Extend
If we don't do it in this PR, let's file a ticket to fix follow on PRs
a088502 to
cd4f46a
Compare
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the variant_interop tests will fail if we mark empty field names as sorted, hence the additional branch.
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional assertions that test the constructor methods
749d259 to
0b3f3a0
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- I think this could be merged as is
It would be good to potentially avoid the duplication I pointed out, and maybe some more docs, but otherwise this looks great
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite neat
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty neat.
I think we should add tests of some type for this impls
I personally suggest adding to the doc tests for MetadataBuilder, something like
let metadata = MetadataBuilder::from(["foo", "bar" "baz"])or add a specific test below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why AsRef<str> instead of Into<String>? The former forces a copy even when passing owned strings.
Tho IMO this is a messy part of rust... we really want something Cow-flavored that handles all of Deref + Borrow + AsRef + Into... today you have to pick just one, because adding a second causes ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fair. I chose AsRef<str> since the upsert_field_name took a &str. Since upsert_field_name is a public method, I chose not to update the function header and work around this.
As for using Cows, I think this is a really good idea. There are some areas where I feel like we could really benefit from an owned string. For example:
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one small fix
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.is_sorted = | |
| n == 1 || self.is_sorted & (self.field_names[n - 2] < self.field_names[n - 1]); | |
| self.is_sorted = | |
| n == 1 || self.is_sorted && (self.field_names[n - 2] < self.field_names[n - 1]); |
(&& has short-circuit behavior and avoids the string comparison when we already broke sorting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a code comment might help here, there's a lot to unpack:
- An empty dictionary is unsorted (ambiguous in spec but required by interop tests)
- A single-entry dictionary is trivially sorted
- Otherwise, an already-sorted dictionary becomes unsorted if the new entry breaks order
7534aa6 to
e6e985d
Compare
e6e985d to
66a1262
Compare
| if new_entry { | ||
| let n = self.num_field_names(); | ||
|
|
||
| // Dictionary sort order tracking: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👨🍳 👌
| /// assert!(result.is_err()); | ||
| /// ``` | ||
| /// | ||
| /// # Example: Sorted dictionaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it
|
Thanks again @scovich and @friendlymatthew - I think this one is looking really nice FYI @zeroshade -- we implemented the solution from Go in Rust! |
note: this PR is based off of #7808Which issue does this PR close?
Rationale for this change
The Parquet variant supports creating objects with sorted dictionaries, where field names are sorted in lexicographical order. Sorting the dictionary and reassigning field IDs afterward can be computationally expensive. This PR offers an alternative: allowing users to specify the field names upfront, so that the correct field IDs can be assigned directly. (The correct field ids being correlated to the lexicographical sort order).
This PR introduces two new public methods to
VariantBuilder:with_field_names, a builder method that takes in aself, initializes the dictionary, and returns the modified builderadd_field_name, a method to add individual field names to the dictionary manually