-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems e.g. #5942
However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. I want an easier way to compare DataFusion performance with a proposed change -- ideally a single command to run and get a report that tells me "does this PR make DataFusion faster or slower". This most recently came up as part of #6034
DataFusion has several benchmark runners but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on #6034 (comment))
Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report.
This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)
Describe the solution you'd like
I want a documented methodology (ideally in a script) that will do:
- Setup (creates / downloads / whatever) the data files needed
- Run that writes timing information into log files
- Compare writes out a report comparing the runs
We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh
Describe alternatives you've considered
Additional context
This will likely result in cleaning up the runners in #5502