Skip to content

Easy DataFusion / DataFusion Benchmarking #6127

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems e.g. #5942

However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. I want an easier way to compare DataFusion performance with a proposed change -- ideally a single command to run and get a report that tells me "does this PR make DataFusion faster or slower". This most recently came up as part of #6034

DataFusion has several benchmark runners but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on #6034 (comment))

Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report.

This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)

Describe the solution you'd like
I want a documented methodology (ideally in a script) that will do:

  1. Setup (creates / downloads / whatever) the data files needed
  2. Run that writes timing information into log files
  3. Compare writes out a report comparing the runs

We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh

Describe alternatives you've considered

Additional context
This will likely result in cleaning up the runners in #5502

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions