-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Support nested lists and object lists #7740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3b539d0 to
0168c8c
Compare
|
@PinkCrow007 @superserious-dev and @mkarbo it would also be great to hear your thoughts as well if you have time to review this PR |
|
Hi @scovich, I'd be curious to get your thoughts on this PR as well! |
|
Yes indeed, @scovich your review would be very helpful. I plan to review this PR as well tomorrow |
| /// }; | ||
| /// | ||
| /// assert_eq!( | ||
| /// obj1.field_by_name("id").unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this API really awkward too -- here is a suggestion on making it better
# Which issue does this PR close? - part of #6736 # Rationale for this change - While reviewing @friendlymatthew 's PR #7740 I found that the code to get the Variant object was awkward I think that an accessor is similar to the existing `as_null`, `as_i32,` etc APIs. # What changes are included in this PR? 1. Add Variant::as_object and Variant::as_list # Are there any user-facing changes? New API (and docs with tests)
|
FYI: #7755 got merged, so you can now use the respective methods. |
3625f37 to
e25d359
Compare
Thank you, this PR has been rebased |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- I went through this PR and I thought it is nicely tested and structured. I had some comments but I don't think anything is required
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice pattern
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor thought would be to move the construction of the metadata / dictionary into the field_metadata_dictionary itself
Something like
let metadata = self.field_metadata_dictionary.build()I am thinking the more encapsulated building the metadata dictionary is, the easier it will be to optimize it later (for example to avoid allocating and copying so many strings)
Doesn't have to be for this PR, I just noticed it during review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that! Could we do this in a subsequent PR? I have another PR that will follow this that does nested object and object with list building
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent_buffet.offset() seems to just return the length.
I think this code is doing the equivalent of self.parent_buffer.reserve(header_size) to reserver enough space (I am not sure what make_room_for_header is doing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't replace make_room_for_header with Vec::reserve here, since make_room_for_header actually allocates header_size bytes worth of space, whereas reserve merely ensures capacity without changing the buffer's length.
We could do this for now,
self.parent_buffer
.0
.resize(self.parent_buffer.0.len() + header_size, 0);but I think the actual issue is there's room to simplify the whole buffer-copy slicing ceremony going on here. I have some thoughts that I'd like to push up in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new PR sounds good -- there are other APIs like extend and push for Vecs that might simplify the code (and likely perform better too)
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this can be writing to anything other than the end of parent_buffer and thus it seems like the code to handle other positions is unecesssary
I also see you just refactored this here rather than introducing it, so no need to change it in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #7740 (comment)
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on it would be really nice to avoid having to copy all the values back to the parent buffer.
One problem is that you don't know beforehand how many elements are in the list (so we don't know the offset size and therefore we don't know how many bytes to leave empty for the list length). However, I think we could do something fancy like assume it will be a short list (and leave 1 byte for the length) and then if we end up with more we could go back and copy over the values
We could also add a nice api like make_large() or something on the ListBuilder
I am just speculating, there is no need to do this here in this PR or without benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I found I already wrote this up:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this as well -- and yes, now that we can build large lists, it'll be worthy of profiling and benchmarking 👍
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
e25d359 to
16bae15
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @friendlymatthew -- let's merge this one in and keep on iterating
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Belated review (I had it pending but forgot to post it, oops)
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps worth factoring out an append_primitive_header helper?
| self.0 | |
| .push(primitive_header(VariantPrimitiveType::TimestampMicros)); | |
| self.append_primitive_header(VariantPrimitiveType::TimestampMicros); |
(similar for append_from_slice)
| #[derive(Default)] | ||
| struct FieldMetadataDictionary { | ||
| field_name_to_id: BTreeMap<String, u32>, | ||
| field_names: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose does this vec of strings serve? Seems like the map is all we need?
(it looks like all the immediately obvious uses of field_names can be replaced by similar operations over the map instead)
|
|
||
| impl FieldMetadataDictionary { | ||
| /// Add field name to dictionary, return its ID | ||
| fn add_field_name(&mut self, field_name: &str) -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really "add"? More of an "upsert" or "ensure" ?
| use std::collections::btree_map::Entry; | ||
| match self.field_name_to_id.entry(field_name.to_string()) { | ||
| Entry::Occupied(entry) => *entry.get(), | ||
| Entry::Vacant(entry) => { | ||
| let id = self.field_names.len() as u32; | ||
| entry.insert(id); | ||
| self.field_names.push(field_name.to_string()); | ||
| id | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get rid of field_names, then I believe this simplifies to just
| use std::collections::btree_map::Entry; | |
| match self.field_name_to_id.entry(field_name.to_string()) { | |
| Entry::Occupied(entry) => *entry.get(), | |
| Entry::Vacant(entry) => { | |
| let id = self.field_names.len() as u32; | |
| entry.insert(id); | |
| self.field_names.push(field_name.to_string()); | |
| id | |
| } | |
| } | |
| } | |
| *self | |
| .field_name_to_id | |
| .entry(field_name.to_string()) | |
| .or_insert(self.field_names.len() as u32) |
|
|
||
| /// Builder for [`Variant`] values | ||
| #[derive(Default)] | ||
| struct ValueBuffer(Vec<u8>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding impl Deref and impl DerefMut to the underlying vec might simplify a lot of code -- if it's ok to expose it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I was thinking about this and I'm a bit cautious to do so. Since the inner value is just a Vec<u8>, I thought it would be nice to encapsulate this with a well defined API
@friendlymatthew -- are you interested in making the changes as a new PR? Or maybe @scovich you would like to just make the suggestions yourself directly in a new PR? |
Hi, I'm happy to make the following changes. I'm going to push the nested object + object list building first and then will push up follow up PRs if that is alright. |
# Which issue does this PR close? - Closes #7696 - Related to #7740 # Rationale for this change This PR adds a feature to `Variant::ObjectBuilder` that enables constructing nested objects and objects with lists. # Are there any user-facing changes? Adds two public methods to the `ObjectBuilder` API, `new_list` and `new_object` --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
VariantBuilder#7696Rationale for this change
Heavily refactors
VariantBuilder,ObjectBuilder, andListBuilderto allow nested list and object building.Now,
ObjectBuilderandListBuilderhave their ownVariantBuffer, an intermediate buffer that gets written to upon every call toinsertorappend_value. Only whenfinishis called will the builder flush the intermediate buffer to its parent buffer.VariantBuilderwas split to holdVariantBufferandFieldMetadataDictionaryto better simplify and separate the nesting logic.