Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jun 30, 2025

note: this PR is based off of #7808

Which issue does this PR close?

Rationale for this change

The Parquet variant supports creating objects with sorted dictionaries, where field names are sorted in lexicographical order. Sorting the dictionary and reassigning field IDs afterward can be computationally expensive. This PR offers an alternative: allowing users to specify the field names upfront, so that the correct field IDs can be assigned directly. (The correct field ids being correlated to the lexicographical sort order).

This PR introduces two new public methods to VariantBuilder:

  • with_field_names, a builder method that takes in a self, initializes the dictionary, and returns the modified builder
  • add_field_name, a method to add individual field names to the dictionary manually

@friendlymatthew
Copy link
Contributor Author

cc @alamb @scovich

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 30, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch from 34fe046 to ede73d2 Compare June 30, 2025 23:08
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no field names in the variant, we'll mark it as not sorted.

The spec leaves this pretty ambiguous..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me

Copy link
Contributor

@scovich scovich Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. Especially since sortedness only matters if there are enough field names that binary search beats linear search (which usually starts around 10-20 entries).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand... why do the extra work to check for is_empty? The is_sorted will return a correct result either way.

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch from ede73d2 to 48d58ee Compare June 30, 2025 23:11
@alamb
Copy link
Contributor

alamb commented Jun 30, 2025

note: this PR is based off of #7808

I just merged that PR so we can probably rebase this one now

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- this is looking good. I left some comments -- let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to give some more context here like

/// This method pre-populates the field name dictionary in the Variant Metadata 
/// with the specified field names, in order. 
///
/// You can use this to prepopulate a VariantBuilder with a sorted dictionary if 
/// you know the field names before hand somehow. Sorted dictionaries can accelerate
/// field access when reading Variants. 

I also think it would be great to add a note and a link to the documentation of VariantBuilder::build telling about this method

...
/// Note: the builder checks if the metadata dictionary is sorted and if it is
/// marks the metadata as sorted, which can accelerate reading Variant values.
/// Field names are added to the metadata dictionary in the order they are first 
/// encoutered. See [`Self::with_field_names`] to pre-populate field names in 
/// order if you know beforehand the fields that can appear in your objects
fn build(self) -> ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, thank you for your suggestion. I need to get better at my prose!

Comment on lines 1417 to 1428
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am don't think it is important to verify these internal details of the field_ids -- they key thing is to verify that the object is sorted if the keys are pre-added in sorted order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here, I don't think checking the field ids is important.

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch from 48d58ee to 2e1bc84 Compare July 1, 2025 00:27
@alamb alamb marked this pull request as draft July 1, 2025 11:10
@alamb alamb marked this pull request as ready for review July 1, 2025 11:10
@friendlymatthew friendlymatthew marked this pull request as draft July 1, 2025 11:50
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch 2 times, most recently from c026a95 to dc2221b Compare July 1, 2025 20:26
@friendlymatthew friendlymatthew marked this pull request as ready for review July 1, 2025 20:36
@friendlymatthew friendlymatthew requested a review from alamb July 1, 2025 21:04
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition! Some thoughts about how to make the API cleaner/safer.

Comment on lines 244 to 247
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field_names?

Also, why impl Iterator instead of the more common impl IntoIterator, out of curiosity? Can also take items that impl Into<String>, for additional flexibility.

Suggested change
fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self {
Self {
field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())),
}
fn from_field_names<'a>(field_names: impl IntoIterator<Item = impl Into<String>>) -> Self {
let field_names = field_names.into_iter().map(Into::into).collect();
Self { field_names }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- would it make sense to provide an impl FromIterator for MetadataBuilder, instead of or in addition to this constructor? Then people could build a new instance in all the usual rustic ways, e.g.

let builder = MetadataBuilder::from_iter(["a", "b", "c"]);

or

self.metadata_builder = ["a", "b", "c"].into_iter().collect();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree switching to IntoIterator would be nicer as would implemeting Extend

If we don't do it in this PR, let's file a ticket to fix follow on PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: On the off-chance somebody ever managed to insert more than 2**32-1 entries, this would silently produce wrong results. We should consider returning usize and require the caller to do the conversion, since they have a better chance of being able to handle the issue cleanly.

On the other hand, we have two potential sources of "protection" already:

  1. Most machines will run out of memory long before they hit u32 overflow, and that will panic. So it would be hard to trigger the bug in the first place.
  2. num_field_names returns usize. So we could make the builder's finish method fallible and then check for u32 overflow only once at the end. Downside is, then we did all the work to insert those extra field names even tho we're doomed to fail later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option could be to change the code to panic in this unlikely case and deal with it when / if we have an actual issue

As you point out, I don't think this is related to this PR

Copy link
Contributor

@scovich scovich Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead?

Although the total number of operations is the same, it should be cheaper because the bytes of a freshly inserted field name will still be in CPU cache (having just been initialized and/or hashed). So the string comparison will be a lot faster than if it takes a cache miss after not having been accessed for a long time.

Details

The builder would default to is_sorted=true, but switches to false if any newly inserted field name breaks ordering:

fn upsert_field_name(&mut self, field_name: &str) -> u32 {
    let (id, inserted) = self.field_names.insert_full(field_name.to_string());
    if inserted {
        let n = self.field_names.len();
        if n >= 2 && self.field_names[n-2] > self.field_names[n-1] {
            self.is_sorted = false;
        }
    }
    id as u32
}

And then when initializing from an iterator, just insert each entry manually in a loop

let mut new_self = Self::new();
for field_name in field_names {
    let _ = new_self.upsert_field_name(field_name);
}
new_self

Actually, if we impl Extend for MetadataBuilder, then the initialization becomes even simpler:

let mut new_self = Self::new();
new_self.extend(field_names);
new_self

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead?

I am not sure it is all that expensive
https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.is_sorted

I think it simply checks all the elements pariwise (code)

I don't think it is any more/less expensive in theory to check the dictionary on write than to do the checks incrementally as the variant is written. In each case it needs to do N-1 comparisons where N is the number of dictionary entries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it comes down to CPU caches and memory bandwidth, for larger dictionaries and/or longer strings. Technically constant factors but not necessarily ones we should ignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would wipe out any previously added field names, which seems like an unnecessary footgun?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, similar to above -- would it make sense to impl Extend for MetadataBuilder and then extend here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the metadata builder also invalidates any previously created objects which seems likely to be pretty bad

I agree we should change this to append to (not replace) existing fields

    pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self {
        for field_name in field_names {
          self.metadata_builder.upsert_field_name(field_name)
        }
        self
     }

The Extend is an excellent idea as well, but it is less clear to me that Extend on a VariantBuilder should only extend fields -- I would expect Extend to take Variants and logically extend the in-progress variant 🤔

Copy link
Contributor

@scovich scovich Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Extend (and FromIterator) suggestion was only for the underlying MetadataBuilder. Agree it would be weird to have one for field names only at variant builder level!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich and @friendlymatthew

from my perspective the only thing that needs to be fixed are:

  1. with_field_names should append rather than replace

@scovich 's idea of dding Extend is an excellent idea as well, but I think it could be done as a follow on PR as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead?

I am not sure it is all that expensive
https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.is_sorted

I think it simply checks all the elements pariwise (code)

I don't think it is any more/less expensive in theory to check the dictionary on write than to do the checks incrementally as the variant is written. In each case it needs to do N-1 comparisons where N is the number of dictionary entries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option could be to change the code to panic in this unlikely case and deal with it when / if we have an actual issue

As you point out, I don't think this is related to this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the metadata builder also invalidates any previously created objects which seems likely to be pretty bad

I agree we should change this to append to (not replace) existing fields

    pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self {
        for field_name in field_names {
          self.metadata_builder.upsert_field_name(field_name)
        }
        self
     }

The Extend is an excellent idea as well, but it is less clear to me that Extend on a VariantBuilder should only extend fields -- I would expect Extend to take Variants and logically extend the in-progress variant 🤔

Comment on lines 244 to 247
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree switching to IntoIterator would be nicer as would implemeting Extend

If we don't do it in this PR, let's file a ticket to fix follow on PRs

@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Jul 3, 2025

Hi, I think @scovich has really good points, I just need some time to get this over the line!

I've been a bit busy at work, but I'll be free in a couple of hours to work on this. I also plan on reviewing #7849

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch 2 times, most recently from a088502 to cd4f46a Compare July 3, 2025 21:15
Comment on lines 250 to 259
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the variant_interop tests will fail if we mark empty field names as sorted, hence the additional branch.

Comment on lines +1684 to +1770
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional assertions that test the constructor methods

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch 2 times, most recently from 749d259 to 0b3f3a0 Compare July 4, 2025 22:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- I think this could be merged as is

It would be good to potentially avoid the duplication I pointed out, and maybe some more docs, but otherwise this looks great

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite neat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty neat.

I think we should add tests of some type for this impls

I personally suggest adding to the doc tests for MetadataBuilder, something like

let metadata = MetadataBuilder::from(["foo", "bar" "baz"])

or add a specific test below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why AsRef<str> instead of Into<String>? The former forces a copy even when passing owned strings.

Tho IMO this is a messy part of rust... we really want something Cow-flavored that handles all of Deref + Borrow + AsRef + Into... today you have to pick just one, because adding a second causes ambiguity.

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair. I chose AsRef<str> since the upsert_field_name took a &str. Since upsert_field_name is a public method, I chose not to update the function header and work around this.

As for using Cows, I think this is a really good idea. There are some areas where I feel like we could really benefit from an owned string. For example:

Screenshot 2025-07-05 at 8 44 18 AM

@friendlymatthew friendlymatthew marked this pull request as draft July 5, 2025 12:23
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one small fix

Comment on lines 253 to 254
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.is_sorted =
n == 1 || self.is_sorted & (self.field_names[n - 2] < self.field_names[n - 1]);
self.is_sorted =
n == 1 || self.is_sorted && (self.field_names[n - 2] < self.field_names[n - 1]);

(&& has short-circuit behavior and avoids the string comparison when we already broke sorting)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a code comment might help here, there's a lot to unpack:

  • An empty dictionary is unsorted (ambiguous in spec but required by interop tests)
  • A single-entry dictionary is trivially sorted
  • Otherwise, an already-sorted dictionary becomes unsorted if the new entry breaks order

@friendlymatthew friendlymatthew marked this pull request as ready for review July 5, 2025 13:54
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch from 7534aa6 to e6e985d Compare July 5, 2025 13:55
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/sorted-dictionary branch from e6e985d to 66a1262 Compare July 5, 2025 13:55
@alamb alamb changed the title [Variant] Create sorted dictionaries [Variant] Support creating sorted dictionaries Jul 5, 2025
if new_entry {
let n = self.num_field_names();

// Dictionary sort order tracking:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌

/// assert!(result.is_err());
/// ```
///
/// # Example: Sorted dictionaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

@alamb
Copy link
Contributor

alamb commented Jul 5, 2025

Thanks again @scovich and @friendlymatthew - I think this one is looking really nice

FYI @zeroshade -- we implemented the solution from Go in Rust!

@alamb alamb merged commit aef3bdd into apache:main Jul 5, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Improve VariantBuilder when creating field name dictionaries / sorted dictionaries

3 participants