-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: Add TopK benchmarks as variation over the sort_tpch benchmarks
#15560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e1aacd2 to
23136cc
Compare
| csv = csv.repartition(Partitioning::RoundRobinBatch(partitions))? | ||
| } | ||
| let csv = if self.sort { | ||
| csv.sort_by(vec![col(key_column_name)])? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sort_by means there is a single output partition, which translates to a single output Parquet file. Enforcing multiple output files per table while maintaining the ordering information needs something a bit more complex than the simple original ctx.write_parquet(...)...
| SELECT l_shipmode, l_comment, l_partkey | ||
| FROM lineitem | ||
| ORDER BY l_shipmode; | ||
| ORDER BY l_shipmode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed trailing ; to allow appending LIMIT n
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @geoffreyclaude -- would it be possible to add some documentation about this benchmark in the readme?
https://github.com/apache/datafusion/tree/main/benchmarks#sort-tpch
23136cc to
a9d0c58
Compare
@alamb: done in commit |
|
Thanks @geoffreyclaude and @Dandandan |
apache#15560) * perf: Add TopK benchmarks as variation over the `sort_tpch` benchmarks * doc: Document TopK benchmark options: `--sorted` and `--limit`
Which issue does this PR close?
Rationale for this change
Currently, the benchmarks folder in DataFusion does not include dedicated benchmarks for TopK queries (i.e., queries formatted as
SELECT ... ORDER BY column LIMIT n).With ongoing work to optimize these queries, having dedicated benchmarks is valuable for measuring progress.
What changes are included in this PR?
Sorted TPCH Support
--sortflag has been added totpch/convert.rsto output the TPCH tables sorted by their first (key) column. Although the generator outputs CSV files already sorted by the first column, the sorted order was not stored in the converted files.--sortedflag has been added to bothsort_tpch.rsandtpch/run.rs. When enabled, it injects the file sort order into theListingOptions, allowing DataFusion optimizations to take advantage of pre-sorted input. This is necessary because DataFusion does not currently read the "sortedness" from Parquet files.TopK Benchmark Extension
sort_tpch.rs, an optional--limit nflag has been introduced. When provided, it appends aLIMIT nclause to the SQL query, effectively converting a standard sort query into a TopK query.--limit nand--sortedflags, it is now possible to test TopK queries on pre-sorted inputs.Are these changes tested?
To benchmark Top10 over sorted inputs:
Are there any user-facing changes?
No, only developer-facing benchmark changes:
--limitto append a LIMIT clause,--sortedto indicate pre-sorted input data, and--sortto generate pre-sorted input data.