Skip to content

Conversation

@geoffreyclaude
Copy link
Contributor

@geoffreyclaude geoffreyclaude commented Apr 3, 2025

Which issue does this PR close?

Rationale for this change

Currently, the benchmarks folder in DataFusion does not include dedicated benchmarks for TopK queries (i.e., queries formatted as SELECT ... ORDER BY column LIMIT n).

With ongoing work to optimize these queries, having dedicated benchmarks is valuable for measuring progress.

What changes are included in this PR?

Sorted TPCH Support

  • A new --sort flag has been added to tpch/convert.rs to output the TPCH tables sorted by their first (key) column. Although the generator outputs CSV files already sorted by the first column, the sorted order was not stored in the converted files.
  • A new --sorted flag has been added to both sort_tpch.rs and tpch/run.rs. When enabled, it injects the file sort order into the ListingOptions, allowing DataFusion optimizations to take advantage of pre-sorted input. This is necessary because DataFusion does not currently read the "sortedness" from Parquet files.

TopK Benchmark Extension

  • In sort_tpch.rs, an optional --limit n flag has been introduced. When provided, it appends a LIMIT n clause to the SQL query, effectively converting a standard sort query into a TopK query.
  • By combining the --limit n and --sorted flags, it is now possible to test TopK queries on pre-sorted inputs.

Are these changes tested?

To benchmark Top10 over sorted inputs:

> cd benchmarks
> cargo run --release --bin tpch -- convert --input ./data/tpch_sf1 --output ./data/tpch_sf1_sorted --format parquet --sort
> cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 --iterations 5 --path ./data/tpch_sf1_sorted -o /tmp/top10_sorted_tpch.json

Are there any user-facing changes?

No, only developer-facing benchmark changes:

  • New command-line options have been added to the benchmarks: --limit to append a LIMIT clause, --sorted to indicate pre-sorted input data, and --sort to generate pre-sorted input data.
  • These options are opt-in and do not affect the default behavior of the benchmarks unless explicitly specified.

csv = csv.repartition(Partitioning::RoundRobinBatch(partitions))?
}
let csv = if self.sort {
csv.sort_by(vec![col(key_column_name)])?
Copy link
Contributor Author

@geoffreyclaude geoffreyclaude Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sort_by means there is a single output partition, which translates to a single output Parquet file. Enforcing multiple output files per table while maintaining the ordering information needs something a bit more complex than the simple original ctx.write_parquet(...)...

SELECT l_shipmode, l_comment, l_partkey
FROM lineitem
ORDER BY l_shipmode;
ORDER BY l_shipmode
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed trailing ; to allow appending LIMIT n

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @geoffreyclaude -- would it be possible to add some documentation about this benchmark in the readme?

https://github.com/apache/datafusion/tree/main/benchmarks#sort-tpch

@geoffreyclaude
Copy link
Contributor Author

Thanks @geoffreyclaude -- would it be possible to add some documentation about this benchmark in the readme?

https://github.com/apache/datafusion/tree/main/benchmarks#sort-tpch

@alamb: done in commit doc: Document TopK benchmark options: --sorted and --limit

@alamb alamb merged commit a7e71a7 into apache:main Apr 4, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

Thanks @geoffreyclaude and @Dandandan

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
apache#15560)

* perf: Add TopK benchmarks as variation over the `sort_tpch` benchmarks

* doc: Document TopK benchmark options: `--sorted` and `--limit`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend benchmarking to "TopK" queries

3 participants