Top-N: Improve performance with large heaps, and correctly call Reduce#14900
Merged
Mytherin merged 11 commits intoduckdb:mainfrom Nov 19, 2024
Merged
Top-N: Improve performance with large heaps, and correctly call Reduce#14900Mytherin merged 11 commits intoduckdb:mainfrom
Mytherin merged 11 commits intoduckdb:mainfrom
Conversation
…in every sink iteration
2 tasks
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 24, 2024
Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900) python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 24, 2024
Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900) python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up from #14424
Fixes #14896
Large Heaps
When using a large
offset, the heap we generate in the Top-N operator is large. The previous implementation took some shortcuts that made it work poorly with large heaps, resulting in performance degradations compared to the previous implementation:TopNHeap::Combinewould scan and re-insert the values in the heaps, instead of directly merging heapsTopNHeap::Sinkwould fully scan the heap at every iterationThis PR fixes these issues.
AddSmallHeap/AddLargeHeapThe previous heap insertion happened in two stages:
The main reason for this two-stage approach is to deal with many conflicts within the same data chunk. For example, consider the query:
Since
lineitemis sorted byl_orderkey, every chunk we stream into the operator will be full of the new highest values. Without this two-stage approach, we will append all rows to the payload data, only to then discard most rows again. Using the two stage approach we append at most 5 rows (the heap size) to the payload at each iteration, leading to lower memory usage and fewer calls to Reduce.AddLargeHeap
This PR adds a new, simpler, single-stage insertion where we insert the sort keys and immediately insert payload data. We switch to this approach for heaps
>= 100 rows. By immediately inserting the payload data we don't need to scan the heap during the sink phase, which has great speed-ups for when we are dealing with larger heaps.Benchmarks
Below is an example of a Top-N with a large heap that is sped up significantly by this approach: