Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jun 22, 2025

Which issue does this PR close?

Rationale for this change

Heavily refactors VariantBuilder, ObjectBuilder, and ListBuilder to allow nested list and object building.

Now, ObjectBuilder and ListBuilder have their own VariantBuffer, an intermediate buffer that gets written to upon every call to insert or append_value. Only when finish is called will the builder flush the intermediate buffer to its parent buffer.

VariantBuilder was split to hold VariantBuffer and FieldMetadataDictionary to better simplify and separate the nesting logic.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 22, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/new-builder branch 5 times, most recently from 3b539d0 to 0168c8c Compare June 23, 2025 12:00
@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

@PinkCrow007 @superserious-dev and @mkarbo it would also be great to hear your thoughts as well if you have time to review this PR

@friendlymatthew
Copy link
Contributor Author

Hi @scovich, I'd be curious to get your thoughts on this PR as well!

@alamb
Copy link
Contributor

alamb commented Jun 23, 2025

Yes indeed, @scovich your review would be very helpful.

I plan to review this PR as well tomorrow

/// };
///
/// assert_eq!(
/// obj1.field_by_name("id").unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this API really awkward too -- here is a suggestion on making it better

crepererum pushed a commit that referenced this pull request Jun 24, 2025
# Which issue does this PR close?

- part of #6736

# Rationale for this change

- While reviewing @friendlymatthew 's PR
#7740 I found that the code to
get the Variant object was awkward


I think that an accessor is similar to the existing `as_null`, `as_i32,`
etc APIs.

# What changes are included in this PR?

1. Add Variant::as_object and Variant::as_list

# Are there any user-facing changes?
New API (and docs with tests)
@crepererum
Copy link
Contributor

FYI: #7755 got merged, so you can now use the respective methods.

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/new-builder branch 2 times, most recently from 3625f37 to e25d359 Compare June 24, 2025 11:57
@friendlymatthew
Copy link
Contributor Author

FYI: #7755 got merged, so you can now use the respective methods.

Thank you, this PR has been rebased

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- I went through this PR and I thought it is nicely tested and structured. I had some comments but I don't think anything is required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice pattern

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor thought would be to move the construction of the metadata / dictionary into the field_metadata_dictionary itself

Something like

let metadata = self.field_metadata_dictionary.build()

I am thinking the more encapsulated building the metadata dictionary is, the easier it will be to optimize it later (for example to avoid allocating and copying so many strings)

Doesn't have to be for this PR, I just noticed it during review

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that! Could we do this in a subsequent PR? I have another PR that will follow this that does nested object and object with list building

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent_buffet.offset() seems to just return the length.

I think this code is doing the equivalent of self.parent_buffer.reserve(header_size) to reserver enough space (I am not sure what make_room_for_header is doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't replace make_room_for_header with Vec::reserve here, since make_room_for_header actually allocates header_size bytes worth of space, whereas reserve merely ensures capacity without changing the buffer's length.

We could do this for now,

self.parent_buffer
    .0
    .resize(self.parent_buffer.0.len() + header_size, 0);

but I think the actual issue is there's room to simplify the whole buffer-copy slicing ceremony going on here. I have some thoughts that I'd like to push up in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new PR sounds good -- there are other APIs like extend and push for Vecs that might simplify the code (and likely perform better too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this can be writing to anything other than the end of parent_buffer and thus it seems like the code to handle other positions is unecesssary

I also see you just refactored this here rather than introducing it, so no need to change it in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on it would be really nice to avoid having to copy all the values back to the parent buffer.

One problem is that you don't know beforehand how many elements are in the list (so we don't know the offset size and therefore we don't know how many bytes to leave empty for the list length). However, I think we could do something fancy like assume it will be a short list (and leave 1 byte for the length) and then if we end up with more we could go back and copy over the values

We could also add a nice api like make_large() or something on the ListBuilder

I am just speculating, there is no need to do this here in this PR or without benchmarks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this as well -- and yes, now that we can build large lists, it'll be worthy of profiling and benchmarking 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/new-builder branch from e25d359 to 16bae15 Compare June 24, 2025 18:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @friendlymatthew -- let's merge this one in and keep on iterating

@alamb alamb merged commit 8d8541c into apache:main Jun 24, 2025
12 checks passed
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated review (I had it pending but forgot to post it, oops)

Comment on lines +131 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps worth factoring out an append_primitive_header helper?

Suggested change
self.0
.push(primitive_header(VariantPrimitiveType::TimestampMicros));
self.append_primitive_header(VariantPrimitiveType::TimestampMicros);

(similar for append_from_slice)

#[derive(Default)]
struct FieldMetadataDictionary {
field_name_to_id: BTreeMap<String, u32>,
field_names: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does this vec of strings serve? Seems like the map is all we need?
(it looks like all the immediately obvious uses of field_names can be replaced by similar operations over the map instead)


impl FieldMetadataDictionary {
/// Add field name to dictionary, return its ID
fn add_field_name(&mut self, field_name: &str) -> u32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really "add"? More of an "upsert" or "ensure" ?

Comment on lines +224 to +234
use std::collections::btree_map::Entry;
match self.field_name_to_id.entry(field_name.to_string()) {
Entry::Occupied(entry) => *entry.get(),
Entry::Vacant(entry) => {
let id = self.field_names.len() as u32;
entry.insert(id);
self.field_names.push(field_name.to_string());
id
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get rid of field_names, then I believe this simplifies to just

Suggested change
use std::collections::btree_map::Entry;
match self.field_name_to_id.entry(field_name.to_string()) {
Entry::Occupied(entry) => *entry.get(),
Entry::Vacant(entry) => {
let id = self.field_names.len() as u32;
entry.insert(id);
self.field_names.push(field_name.to_string());
id
}
}
}
*self
.field_name_to_id
.entry(field_name.to_string())
.or_insert(self.field_names.len() as u32)


/// Builder for [`Variant`] values
#[derive(Default)]
struct ValueBuffer(Vec<u8>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding impl Deref and impl DerefMut to the underlying vec might simplify a lot of code -- if it's ok to expose it that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I was thinking about this and I'm a bit cautious to do so. Since the inner value is just a Vec<u8>, I thought it would be nice to encapsulate this with a well defined API

@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

Belated review (I had it pending but forgot to post it, oops)

@friendlymatthew -- are you interested in making the changes as a new PR?

Or maybe @scovich you would like to just make the suggestions yourself directly in a new PR?

@friendlymatthew
Copy link
Contributor Author

Belated review (I had it pending but forgot to post it, oops)

@friendlymatthew -- are you interested in making the changes as a new PR?

Or maybe @scovich you would like to just make the suggestions yourself directly in a new PR?

Hi, I'm happy to make the following changes. I'm going to push the nested object + object list building first and then will push up follow up PRs if that is alright.

alamb added a commit that referenced this pull request Jun 25, 2025
# Which issue does this PR close?

- Closes #7696
- Related to #7740

# Rationale for this change

This PR adds a feature to `Variant::ObjectBuilder` that enables
constructing nested objects and objects with lists.

# Are there any user-facing changes?

Adds two public methods to the `ObjectBuilder` API, `new_list` and
`new_object`

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support Nested Data in VariantBuilder

4 participants