Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jun 27, 2025

Rationale for this change

This PR modifies the backing data structure for ObjectBuilder.fields and MetadataBuilder.field_name_to_id to speedup the builder code by 62x.

I started to profile our variant builder code and noticed ObjectBuilder::finish took a very long time to complete. The profile involved building up a VariantList with 20,000 VariantObjects, each with a unique field name.

The code on main took 628ms to run, with Object::finish consuming all of its runtime:

Screenshot 2025-06-27 at 1 45 22 PM

The profile shows iterating over the MetadataBuilder's BTreeMap and filtering by relative field ids was the bottleneck:

let field_ids_by_sorted_field_name = self
.metadata_builder
.field_name_to_id
.iter()
.filter_map(|(_, id)| self.fields.contains_key(id).then_some(*id))
.collect::<Vec<_>>();


This is very bad. We can improve by making the following observations.

  1. Field ids also serve as indices into MetadataBuilder.fields, so field name lookup time is O(1)
  2. Vec sorting is faster

By changing ObjectBuilder.fields to use a Vec, the code takes 10ms to run, with Object::finish taking 30% of the runtime.

Screenshot 2025-06-27 at 1 53 59 PM
To reproduce
cargo b --profile profiling
samply record ./target/profiling/object_list
// object_list.rs
use parquet_variant::VariantBuilder;

fn main() {
    let mut builder = VariantBuilder::new();

    let mut list_builder = builder.new_list();

    for i in 0..20_000 {
        let mut obj = list_builder.new_object();
        obj.insert(format!("{}", 20_000 - i).as_str(), i);
        obj.finish();
    }

    list_builder.finish();

    std::hint::black_box(builder.finish());
}

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 27, 2025
@friendlymatthew
Copy link
Contributor Author

cc/ @scovich @alamb

Comment on lines 597 to 594
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we have to do a linear pass when inserting a new (field_id, field_offset) tuple to overwrite duplicates (same field_id, different field_offset).

It is linear because there is no guarantee that the elements are ordered by field_id. That guarantee only lives in MetadataBuilder.fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now outdated, upsert_field is O(1) by making use of a mapping from field_id to the index position in fields.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2025

Thanks for this @friendlymatthew -- I think before we embark on optimizing performance we should probably have some benchmarks to define what we are optimizing

Here are some initial ones I would suggest I think are important:

  1. Create a lot of Variant objects with the same field names but different values

Aka build this variant over and over again :

{ 
  "user": "random",
  "age": 54,
  "comment": "this is a long string that will not fit in short string but is also random",
  "categories": ["red", "blue", "green"],
}
  1. Create a VariantList with 1 to 100 copies of the above objects

There are probably other cases too but that might be a good place to start

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch 4 times, most recently from 1f21100 to 4384c62 Compare June 28, 2025 13:40
@friendlymatthew friendlymatthew marked this pull request as draft June 28, 2025 15:14
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch 2 times, most recently from c796a1c to f570f59 Compare June 28, 2025 16:08
@friendlymatthew friendlymatthew marked this pull request as ready for review June 28, 2025 16:15
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch from f570f59 to cc9f139 Compare June 28, 2025 17:07
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could store a &str here avoiding a to_string() (allocation) in the hot loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a hashmap with ahash as hasher, getting some more speed.

@friendlymatthew
Copy link
Contributor Author

Hi, I'm thinking about two ends of the spectrum: objects with a known schema (schema-on-write) and objects with a unknown schema (schema-on-read). Most variant use cases seem to fall somewhere in the middle, with data that's only partially homogenous.

My benchmarks cover all three scenarios. For each one, I create benchmarks that build both individual Variant objects and a list of Variant objects. I also included bench_object_field_names_reverse_order. This case builds an object with field names inserted in the reverse lexicographical order. I think this is interesting because we can test the sorting behavior in ObjectBuilder::finish.

I ran benchmarks locally (M4 silicon) and I observe a nice improvement in performance:

Screenshot 2025-06-28 at 10 19 51 PM Screenshot 2025-06-28 at 10 19 56 PM

bench_object_list_same_schemas seems to have a small regression, but it's not very much above noise and the much larger improvements to the other flows make me not too concerned about it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- this is looking quite great. The benchmarks I think are going to let us make this code really fast. I have some suggestions for the benchmarks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to profile this to confirm, but I suspect a non trivial amount of the benchmark time isspent actually generating the random strings

To ensure we are only benchmarking the VariantBuilder, I suggest pre-generating a string table and then pick from them quasi-randomly -- something like

// make a table with 117 random strings (so it doesn't evenly divide into 25000)
let strings = (0..117).map(|_| random_string(&mut rng).collect::<Vec<_>>();
...
        b.iter(|| {
          let mut current_string = 0;
...
           object_builder.insert("name", &strings[current_string]);
           current_string = (current_string + 1) % strings.len();
..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise for the other bencmarks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a profile. There is evidence of a non trivial amount of time spent in random creation (and I think a bunch of the allocation / free shown is related to creating Strings as well)

Screenshot 2025-06-29 at 5 20 55 AM

Comment on lines 567 to 568
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use an IndexSet here to store the field_ids as well as preserving order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've recognized that! That's very exciting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I was thinking you could use a single hash set (instead of a map + vec)

    field_id_to_index: IndexSet<u32>, // (field_id, insert order corresponds to field offset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm a bit confused. An IndexSet<u32> is keeping track of the field ids in insert order. Where would we store the respective field offsets?

Copy link
Contributor

@scovich scovich Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to do an IndexMap<u32, u32>, where the key is the field id and the value is the field offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich, I just pushed up a change and it proves to be a nice improvement.

Screenshot 2025-06-30 at 2 42 06 PM Screenshot 2025-06-30 at 2 42 15 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to remove this test as it is basically testing the internal structure / invariants of the builders. I think it makes more sense to test the public APIs 👍

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch 2 times, most recently from 7d6a5d1 to 61c8400 Compare June 30, 2025 00:24
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch from 8d509e2 to b3324da Compare June 30, 2025 00:31
@friendlymatthew
Copy link
Contributor Author

Hi, I was thinking about reducing randomness in our benchmarks. I refactored the benches to use pseudo-random number generators. Plus, we precompute the string table before our runs.

Here are the updated benchmark results. We still observe a general improvement:

Screenshot 2025-06-29 at 9 27 23 PM Screenshot 2025-06-29 at 9 27 28 PM

Variant building with the same object schemas observe a regression, but once again, the much larger improvements to the other flow makes me not too concerned about it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- I think it looks great. There are a few comment fixes I think maybe we can fix them in a follow on pR

Let's get this one merged in to keep forward progress

"ended": u32,
"span_name": String,
"attributees": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"attributees": {
"attributes": {

});
}

// Creates objects with a homogenous schema (same field names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Creates objects with a homogenous schema (same field names)
// Creates objects with a partially homogenous schema (same field names)

@alamb
Copy link
Contributor

alamb commented Jun 30, 2025

BTW I also ran the benchmarks again locally and the profiling shows most of the time in the builders 👍

@alamb alamb merged commit 43f58b2 into apache:main Jun 30, 2025
13 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 30, 2025

Variant building with the same object schemas observe a regression, but once again, the much larger improvements to the other flow makes me not too concerned about it.

I think we should simply optimize this as a follow on PR :)

});

let max_id = self.fields.keys().last().copied().unwrap_or(0) as usize;
let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could cause the max id to be different from fields.len()-1?

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no guarantee that fields is sorted in order of field_id, only insertion order. You could imagine creating an object list where the first object is defined with field names a, b, but the second object is defined in the opposite order.

Here are some cases where max_id isn't necessarily the last element in fields. The examples below would still work by taking the last element, but you could imagine defining 255+ keys and an object has the field ids: [258, 1]. Basing the integer width off the last field id would not be ideal here.

  #[test]
  fn nested() {
      let mut variant = VariantBuilder::new();

      let mut obj1 = variant.new_object();

      obj1.insert("c", false);
      obj1.insert("b", false);

      {
          let mut inner_obj = obj1.new_object("a");
          inner_obj.insert("a", false);
          inner_obj.insert("b", false);

          // Here, the nested object's `field` ids are laid out as [2, 1]
          dbg!(inner_obj.fields.iter().map(|(&i, _)| i).collect::<Vec<_>>());
          inner_obj.finish();
      }

      obj1.finish();
  }

  #[test]
  fn predefined_field_names() {
      let mut variant = VariantBuilder::new().with_field_names(&["a", "b", "c", "d", "e"]);

      let mut obj1 = variant.new_object();

      obj1.insert("e", false);
      obj1.insert("a", false);

      // [4, 0]
      dbg!(obj1.fields.iter().map(|(&i, _)| i).collect::<Vec<_>>());

      obj1.finish();
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to avoid the linear scan, we can keep a field in ObjectBuilder that keeps track of the largest field id upon every insert.

alamb pushed a commit that referenced this pull request Jul 3, 2025
# Which issue does this PR close?

- Closes #7784

# Rationale for this change

This PR uncomments test cases that would panic or cause undesired
behavior. It also follows up from comments in #7808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Avoid second copy of field name in MetadataBuilder

4 participants