Easy DataFusion / DataFusion Benchmarking

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems e.g. https://github.com/apache/arrow-datafusion/issues/5942

However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. I want an easier way to compare DataFusion performance with a proposed change -- ideally a single command to run and get a report that tells me  "does this PR make DataFusion faster or slower". This most recently came up as part of https://github.com/apache/arrow-datafusion/pull/6034

DataFusion has [several benchmark runners](https://github.com/apache/arrow-datafusion/tree/main/benchmarks) but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on https://github.com/apache/arrow-datafusion/pull/6034#issuecomment-1521511462)

Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report. 

This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)

**Describe the solution you'd like**
I want a documented methodology (ideally in a script) that will do:

1. Setup (creates / downloads / whatever) the data files needed
2. Run <name> <optional arguments to restrict what benchmarks are run>  that writes timing information into log files
3. Compare writes out a report comparing the runs



We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh



**Describe alternatives you've considered**

**Additional context**
This will likely result in cleaning up the runners in https://github.com/apache/arrow-datafusion/issues/5502


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Easy DataFusion / DataFusion Benchmarking #6127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Easy DataFusion / DataFusion Benchmarking #6127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions