-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Parquet roundtrip benchmarks #8956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb I borrowed liberally from your parquet footer code 😉 |
|
Example run Details |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- this looks great
I had a few suggestions, but nothing I think is required to merge
I also ran some basic profiling on these benchmarks to see
samply record -- cargo bench --bench parquet_round_trip -- "write int64 dict"And it looks like it is measuring what I would expect:
I have a good feeling that the encoder/decoder is about to get a lot faster...
|
|
||
| /// Creates a [`PrimitiveArray`] of a given `size` and `null_density` | ||
| /// filling it with random numbers generated using the provided `seed`. | ||
| pub fn create_primitive_array_with_seed<T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate copy of these functions? Maybe we can reuse the existing functions in bench_utils.rs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I just copied from https://github.com/alamb/parquet_footer_parsing/blob/main/src/datagen.rs 😅
I'll switch over 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I totally copy/pasted from arrow for that benchmark of course -- it all came full circle!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in d9bd421
| .collect() | ||
| } | ||
|
|
||
| pub fn file_from_spec(spec: ParquetFileSpec, buf_size: Option<usize>) -> Bytes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
|
|
||
| fn read_write(c: &mut Criterion, spec: ParquetFileSpec, msg: &str) { | ||
| let f = file_from_spec(spec, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than passing in the buffer size, maybe the test could pass in the buffer directly (reusing it across calls) to avoid all output buffer allocations 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking that too after submitting. I'll try switching it up some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f622a3d
| for rg in 0..spec.num_row_groups { | ||
| let col_writers = row_group_factory.create_column_writers(rg).unwrap(); | ||
|
|
||
| let encoded_columns = encode_row_group(&schema, &spec, col_writers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to use this lower level api (rather than just writer.write to write the whole batch)?
Using writer.write would likely be less code and I think it might more closely mirror the API people actually use (though now I write this I am not sure I really know what people use)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬 -- my own fault lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d783743
|
|
||
| c.bench_function(&format!("read {msg}"), |b| { | ||
| b.iter(|| { | ||
| let record_reader = ParquetRecordBatchReaderBuilder::try_new(f.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to double check that f is Bytes here (and thus this clone is cheap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change to a more meaningful name 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f622a3d
|
Thanks @alamb. I was a bit surprised that the read and write times were so different.
Hopefully. One thing to hit first is the fixed-len binary encoder. That seems to do a lot of allocations/copies that we should be able to avoid. |
| // arrow::util::bench_util::create_fsb_array with a seed | ||
|
|
||
| /// Creates a random (but fixed-seeded) array of fixed size with a given null density and length | ||
| fn create_fsb_array_with_seed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just move this to bench_util.rs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure -- maybe a follow on PR
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
| // arrow::util::bench_util::create_fsb_array with a seed | ||
|
|
||
| /// Creates a random (but fixed-seeded) array of fixed size with a given null density and length | ||
| fn create_fsb_array_with_seed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure -- maybe a follow on PR
|
Thanks @etseidl |
Which issue does this PR close?
Rationale for this change
Baseline for future improvements.
What changes are included in this PR?
Adds new benchmarks for reading and writing. Currently uses a fixed number of row groups, pages, and rows. Cycles through data types and encodings.
Are these changes tested?
N/A
Are there any user-facing changes?
No