[Variant] Speedup `ObjectBuilder` (62x faster) #7808

friendlymatthew · 2025-06-27T17:57:40Z

Part of this PR closes [Variant] Avoid second copy of field name in MetadataBuilder #7814

Rationale for this change

This PR modifies the backing data structure for ObjectBuilder.fields and MetadataBuilder.field_name_to_id to speedup the builder code by 62x.

I started to profile our variant builder code and noticed ObjectBuilder::finish took a very long time to complete. The profile involved building up a VariantList with 20,000 VariantObjects, each with a unique field name.

The code on main took 628ms to run, with Object::finish consuming all of its runtime:

The profile shows iterating over the MetadataBuilder's BTreeMap and filtering by relative field ids was the bottleneck:

arrow-rs/parquet-variant/src/builder.rs

Lines 652 to 657 in 2754ce5

    
           let field_ids_by_sorted_field_name = self 
        
               .metadata_builder 
        
               .field_name_to_id 
        
               .iter() 
        
               .filter_map(|(_, id)| self.fields.contains_key(id).then_some(*id)) 
        
               .collect::<Vec<_>>();

This is very bad. We can improve by making the following observations.

Field ids also serve as indices into MetadataBuilder.fields, so field name lookup time is O(1)
Vec sorting is faster

By changing ObjectBuilder.fields to use a Vec, the code takes 10ms to run, with Object::finish taking 30% of the runtime.

To reproduce

cargo b --profile profiling
samply record ./target/profiling/object_list

// object_list.rs
use parquet_variant::VariantBuilder;

fn main() {
    let mut builder = VariantBuilder::new();

    let mut list_builder = builder.new_list();

    for i in 0..20_000 {
        let mut obj = list_builder.new_object();
        obj.insert(format!("{}", 20_000 - i).as_str(), i);
        obj.finish();
    }

    list_builder.finish();

    std::hint::black_box(builder.finish());
}

friendlymatthew · 2025-06-27T18:01:29Z

cc/ @scovich @alamb

friendlymatthew · 2025-06-28T02:51:08Z

parquet-variant/src/builder.rs

Unfortunately, we have to do a linear pass when inserting a new (field_id, field_offset) tuple to overwrite duplicates (same field_id, different field_offset).

It is linear because there is no guarantee that the elements are ordered by field_id. That guarantee only lives in MetadataBuilder.fields.

This is now outdated, upsert_field is O(1) by making use of a mapping from field_id to the index position in fields.

alamb · 2025-06-28T10:30:40Z

Thanks for this @friendlymatthew -- I think before we embark on optimizing performance we should probably have some benchmarks to define what we are optimizing

Here are some initial ones I would suggest I think are important:

Create a lot of Variant objects with the same field names but different values

Aka build this variant over and over again :

{ 
  "user": "random",
  "age": 54,
  "comment": "this is a long string that will not fit in short string but is also random",
  "categories": ["red", "blue", "green"],
}

Create a VariantList with 1 to 100 copies of the above objects

There are probably other cases too but that might be a good place to start

Dandandan · 2025-06-28T17:10:11Z

parquet-variant/src/builder.rs

I think you could store a &str here avoiding a to_string() (allocation) in the hot loop.

Dandandan · 2025-06-28T17:11:11Z

parquet-variant/src/builder.rs

You could use a hashmap with ahash as hasher, getting some more speed.

friendlymatthew · 2025-06-29T02:29:11Z

Hi, I'm thinking about two ends of the spectrum: objects with a known schema (schema-on-write) and objects with a unknown schema (schema-on-read). Most variant use cases seem to fall somewhere in the middle, with data that's only partially homogenous.

My benchmarks cover all three scenarios. For each one, I create benchmarks that build both individual Variant objects and a list of Variant objects. I also included bench_object_field_names_reverse_order. This case builds an object with field names inserted in the reverse lexicographical order. I think this is interesting because we can test the sorting behavior in ObjectBuilder::finish.

I ran benchmarks locally (M4 silicon) and I observe a nice improvement in performance:

bench_object_list_same_schemas seems to have a small regression, but it's not very much above noise and the much larger improvements to the other flows make me not too concerned about it.

alamb

Thank you @friendlymatthew -- this is looking quite great. The benchmarks I think are going to let us make this code really fast. I have some suggestions for the benchmarks

alamb · 2025-06-29T09:14:34Z

parquet-variant/benches/builder.rs

I need to profile this to confirm, but I suspect a non trivial amount of the benchmark time isspent actually generating the random strings

To ensure we are only benchmarking the VariantBuilder, I suggest pre-generating a string table and then pick from them quasi-randomly -- something like

// make a table with 117 random strings (so it doesn't evenly divide into 25000) let strings = (0..117).map(|_| random_string(&mut rng).collect::<Vec<_>>(); ... b.iter(|| { let mut current_string = 0; ... object_builder.insert("name", &strings[current_string]); current_string = (current_string + 1) % strings.len(); ..

Likewise for the other bencmarks

Here is a profile. There is evidence of a non trivial amount of time spent in random creation (and I think a bunch of the allocation / free shown is related to creating Strings as well)

parquet-variant/benches/builder.rs

parquet-variant/src/builder.rs

alamb · 2025-06-29T09:26:27Z

parquet-variant/src/builder.rs

You might be able to use an IndexSet here to store the field_ids as well as preserving order

I should've recognized that! That's very exciting.

To be clear I was thinking you could use a single hash set (instead of a map + vec)

field_id_to_index: IndexSet<u32>, // (field_id, insert order corresponds to field offset)

Hm, I'm a bit confused. An IndexSet<u32> is keeping track of the field ids in insert order. Where would we store the respective field offsets?

Would it work to do an IndexMap<u32, u32>, where the key is the field id and the value is the field offset?

Thank you @scovich, I just pushed up a change and it proves to be a nice improvement.

alamb · 2025-06-29T09:27:29Z

parquet-variant/src/builder.rs

I think it makes sense to remove this test as it is basically testing the internal structure / invariants of the builders. I think it makes more sense to test the public APIs 👍

friendlymatthew · 2025-06-30T01:31:40Z

Hi, I was thinking about reducing randomness in our benchmarks. I refactored the benches to use pseudo-random number generators. Plus, we precompute the string table before our runs.

Here are the updated benchmark results. We still observe a general improvement:

Variant building with the same object schemas observe a regression, but once again, the much larger improvements to the other flow makes me not too concerned about it.

alamb

Thank you @friendlymatthew -- I think it looks great. There are a few comment fixes I think maybe we can fix them in a follow on pR

Let's get this one merged in to keep forward progress

alamb · 2025-06-30T23:28:21Z

parquet-variant/benches/variant_builder.rs

+        "ended": u32,
+        "span_name": String,
+
+        "attributees": {


Suggested change

"attributees": {

"attributes": {

alamb · 2025-06-30T23:29:11Z

parquet-variant/benches/variant_builder.rs

+    });
+}
+
+// Creates objects with a homogenous schema (same field names)


Suggested change

// Creates objects with a homogenous schema (same field names)

// Creates objects with a partially homogenous schema (same field names)

alamb · 2025-06-30T23:35:00Z

BTW I also ran the benchmarks again locally and the profiling shows most of the time in the builders 👍

alamb · 2025-06-30T23:35:50Z

Variant building with the same object schemas observe a regression, but once again, the much larger improvements to the other flow makes me not too concerned about it.

I think we should simply optimize this as a follow on PR :)

scovich · 2025-06-30T23:50:01Z

parquet-variant/src/builder.rs

+        });

-        let max_id = self.fields.keys().last().copied().unwrap_or(0) as usize;
+        let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0);


What could cause the max id to be different from fields.len()-1?

We have no guarantee that fields is sorted in order of field_id, only insertion order. You could imagine creating an object list where the first object is defined with field names a, b, but the second object is defined in the opposite order.

Here are some cases where max_id isn't necessarily the last element in fields. The examples below would still work by taking the last element, but you could imagine defining 255+ keys and an object has the field ids: [258, 1]. Basing the integer width off the last field id would not be ideal here.

#[test] fn nested() { let mut variant = VariantBuilder::new(); let mut obj1 = variant.new_object(); obj1.insert("c", false); obj1.insert("b", false); { let mut inner_obj = obj1.new_object("a"); inner_obj.insert("a", false); inner_obj.insert("b", false); // Here, the nested object's `field` ids are laid out as [2, 1] dbg!(inner_obj.fields.iter().map(|(&i, _)| i).collect::<Vec<_>>()); inner_obj.finish(); } obj1.finish(); } #[test] fn predefined_field_names() { let mut variant = VariantBuilder::new().with_field_names(&["a", "b", "c", "d", "e"]); let mut obj1 = variant.new_object(); obj1.insert("e", false); obj1.insert("a", false); // [4, 0] dbg!(obj1.fields.iter().map(|(&i, _)| i).collect::<Vec<_>>()); obj1.finish(); }

If we want to avoid the linear scan, we can keep a field in ObjectBuilder that keeps track of the largest field id upon every insert.

# Which issue does this PR close? - Closes #7784 # Rationale for this change This PR uncomments test cases that would panic or cause undesired behavior. It also follows up from comments in #7808

github-actions bot added the parquet Changes to the parquet crate label Jun 27, 2025

friendlymatthew commented Jun 28, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch 4 times, most recently from 1f21100 to 4384c62 Compare June 28, 2025 13:40

friendlymatthew marked this pull request as draft June 28, 2025 15:14

friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch 2 times, most recently from c796a1c to f570f59 Compare June 28, 2025 16:08

friendlymatthew marked this pull request as ready for review June 28, 2025 16:15

friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch from f570f59 to cc9f139 Compare June 28, 2025 17:07

Dandandan reviewed Jun 28, 2025

View reviewed changes

alamb reviewed Jun 29, 2025

View reviewed changes

friendlymatthew added 4 commits June 29, 2025 20:22

Make ObjectBuilder faster

14ea2dc

Add benches

ef8d91d

Use index set

cbb17a5

Add more benchmarks

d4b5bd8

friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch 2 times, most recently from 7d6a5d1 to 61c8400 Compare June 30, 2025 00:24

Comments

b3324da

friendlymatthew force-pushed the friendlymatthew/make-object-builder-fast branch from 8d509e2 to b3324da Compare June 30, 2025 00:31

Use seedable range

8658805

Use index map for ObjectBuilder fields

cd10945

friendlymatthew requested a review from alamb June 30, 2025 22:02

friendlymatthew mentioned this pull request Jun 30, 2025

[Variant] Support creating sorted dictionaries #7833

Merged

alamb approved these changes Jun 30, 2025

View reviewed changes

alamb merged commit 43f58b2 into apache:main Jun 30, 2025
13 checks passed

scovich reviewed Jun 30, 2025

View reviewed changes

friendlymatthew mentioned this pull request Jul 2, 2025

[Variant] Follow up nits and uncomment test cases #7846

Merged

	let field_ids_by_sorted_field_name = self
	.metadata_builder
	.field_name_to_id
	.iter()
	.filter_map(\|(_, id)\| self.fields.contains_key(id).then_some(*id))
	.collect::<Vec<_>>();

	// Creates objects with a homogenous schema (same field names)
	// Creates objects with a partially homogenous schema (same field names)

[Variant] Speedup ObjectBuilder (62x faster) #7808

[Variant] Speedup ObjectBuilder (62x faster) #7808

Uh oh!

Conversation

friendlymatthew commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Uh oh!

friendlymatthew commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew commented Jun 29, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew commented Jun 30, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Variant] Speedup `ObjectBuilder` (62x faster) #7808

[Variant] Speedup `ObjectBuilder` (62x faster) #7808

friendlymatthew commented Jun 27, 2025 •

edited

Loading

scovich Jun 30, 2025 •

edited

Loading

friendlymatthew Jul 1, 2025 •

edited

Loading