-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Speedup ObjectBuilder (62x faster)
#7808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Speedup ObjectBuilder (62x faster)
#7808
Conversation
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we have to do a linear pass when inserting a new (field_id, field_offset) tuple to overwrite duplicates (same field_id, different field_offset).
It is linear because there is no guarantee that the elements are ordered by field_id. That guarantee only lives in MetadataBuilder.fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now outdated, upsert_field is O(1) by making use of a mapping from field_id to the index position in fields.
|
Thanks for this @friendlymatthew -- I think before we embark on optimizing performance we should probably have some benchmarks to define what we are optimizing Here are some initial ones I would suggest I think are important:
Aka build this variant over and over again : {
"user": "random",
"age": 54,
"comment": "this is a long string that will not fit in short string but is also random",
"categories": ["red", "blue", "green"],
}
There are probably other cases too but that might be a good place to start |
1f21100 to
4384c62
Compare
c796a1c to
f570f59
Compare
f570f59 to
cc9f139
Compare
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could store a &str here avoiding a to_string() (allocation) in the hot loop.
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a hashmap with ahash as hasher, getting some more speed.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- this is looking quite great. The benchmarks I think are going to let us make this code really fast. I have some suggestions for the benchmarks
parquet-variant/benches/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to profile this to confirm, but I suspect a non trivial amount of the benchmark time isspent actually generating the random strings
To ensure we are only benchmarking the VariantBuilder, I suggest pre-generating a string table and then pick from them quasi-randomly -- something like
// make a table with 117 random strings (so it doesn't evenly divide into 25000)
let strings = (0..117).map(|_| random_string(&mut rng).collect::<Vec<_>>();
...
b.iter(|| {
let mut current_string = 0;
...
object_builder.insert("name", &strings[current_string]);
current_string = (current_string + 1) % strings.len();
..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise for the other bencmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to use an IndexSet here to store the field_ids as well as preserving order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should've recognized that! That's very exciting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear I was thinking you could use a single hash set (instead of a map + vec)
field_id_to_index: IndexSet<u32>, // (field_id, insert order corresponds to field offset)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm a bit confused. An IndexSet<u32> is keeping track of the field ids in insert order. Where would we store the respective field offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work to do an IndexMap<u32, u32>, where the key is the field id and the value is the field offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to remove this test as it is basically testing the internal structure / invariants of the builders. I think it makes more sense to test the public APIs 👍
7d6a5d1 to
61c8400
Compare
8d509e2 to
b3324da
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- I think it looks great. There are a few comment fixes I think maybe we can fix them in a follow on pR
Let's get this one merged in to keep forward progress
| "ended": u32, | ||
| "span_name": String, | ||
| "attributees": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "attributees": { | |
| "attributes": { |
| }); | ||
| } | ||
|
|
||
| // Creates objects with a homogenous schema (same field names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Creates objects with a homogenous schema (same field names) | |
| // Creates objects with a partially homogenous schema (same field names) |
|
BTW I also ran the benchmarks again locally and the profiling shows most of the time in the builders 👍 |
I think we should simply optimize this as a follow on PR :) |
| }); | ||
|
|
||
| let max_id = self.fields.keys().last().copied().unwrap_or(0) as usize; | ||
| let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could cause the max id to be different from fields.len()-1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no guarantee that fields is sorted in order of field_id, only insertion order. You could imagine creating an object list where the first object is defined with field names a, b, but the second object is defined in the opposite order.
Here are some cases where max_id isn't necessarily the last element in fields. The examples below would still work by taking the last element, but you could imagine defining 255+ keys and an object has the field ids: [258, 1]. Basing the integer width off the last field id would not be ideal here.
#[test]
fn nested() {
let mut variant = VariantBuilder::new();
let mut obj1 = variant.new_object();
obj1.insert("c", false);
obj1.insert("b", false);
{
let mut inner_obj = obj1.new_object("a");
inner_obj.insert("a", false);
inner_obj.insert("b", false);
// Here, the nested object's `field` ids are laid out as [2, 1]
dbg!(inner_obj.fields.iter().map(|(&i, _)| i).collect::<Vec<_>>());
inner_obj.finish();
}
obj1.finish();
}
#[test]
fn predefined_field_names() {
let mut variant = VariantBuilder::new().with_field_names(&["a", "b", "c", "d", "e"]);
let mut obj1 = variant.new_object();
obj1.insert("e", false);
obj1.insert("a", false);
// [4, 0]
dbg!(obj1.fields.iter().map(|(&i, _)| i).collect::<Vec<_>>());
obj1.finish();
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to avoid the linear scan, we can keep a field in ObjectBuilder that keeps track of the largest field id upon every insert.







Rationale for this change
This PR modifies the backing data structure for
ObjectBuilder.fieldsandMetadataBuilder.field_name_to_idto speedup the builder code by 62x.I started to profile our variant builder code and noticed
ObjectBuilder::finishtook a very long time to complete. The profile involved building up aVariantListwith 20,000VariantObjects, each with a unique field name.The code on main took 628ms to run, with
Object::finishconsuming all of its runtime:The profile shows iterating over the
MetadataBuilder'sBTreeMapand filtering by relative field ids was the bottleneck:arrow-rs/parquet-variant/src/builder.rs
Lines 652 to 657 in 2754ce5
This is very bad. We can improve by making the following observations.
MetadataBuilder.fields, so field name lookup time is O(1)Vecsorting is fasterBy changing
ObjectBuilder.fieldsto use aVec, the code takes 10ms to run, withObject::finishtaking 30% of the runtime.To reproduce