Skip to content

Top-N: Improve performance with large heaps, and correctly call Reduce#14900

Merged
Mytherin merged 11 commits intoduckdb:mainfrom
Mytherin:topnfix
Nov 19, 2024
Merged

Top-N: Improve performance with large heaps, and correctly call Reduce#14900
Mytherin merged 11 commits intoduckdb:mainfrom
Mytherin:topnfix

Conversation

@Mytherin
Copy link
Copy Markdown
Collaborator

Follow-up from #14424

Fixes #14896

Large Heaps

When using a large offset, the heap we generate in the Top-N operator is large. The previous implementation took some shortcuts that made it work poorly with large heaps, resulting in performance degradations compared to the previous implementation:

  • TopNHeap::Combine would scan and re-insert the values in the heaps, instead of directly merging heaps
  • TopNHeap::Sink would fully scan the heap at every iteration

This PR fixes these issues.

AddSmallHeap/AddLargeHeap

The previous heap insertion happened in two stages:

  • Insert the sort keys one-by-one into the heap
  • Scan the heap to figure out which final sort keys still remain inside the heap
  • Append the payload for the corresponding final sort keys

The main reason for this two-stage approach is to deal with many conflicts within the same data chunk. For example, consider the query:

SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;

Since lineitem is sorted by l_orderkey, every chunk we stream into the operator will be full of the new highest values. Without this two-stage approach, we will append all rows to the payload data, only to then discard most rows again. Using the two stage approach we append at most 5 rows (the heap size) to the payload at each iteration, leading to lower memory usage and fewer calls to Reduce.

AddLargeHeap

This PR adds a new, simpler, single-stage insertion where we insert the sort keys and immediately insert payload data. We switch to this approach for heaps >= 100 rows. By immediately inserting the payload data we don't need to scan the heap during the sink phase, which has great speed-ups for when we are dealing with larger heaps.

Benchmarks

Below is an example of a Top-N with a large heap that is sped up significantly by this approach:

select l_orderkey
from lineitem
where l_linestatus = 'O'
order by l_quantity limit 100 offset 1000000;
v1.1 main new
0.86s 6.9s 0.77s

@Mytherin Mytherin mentioned this pull request Nov 19, 2024
2 tasks
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 19, 2024 14:14
@Mytherin Mytherin marked this pull request as ready for review November 19, 2024 14:16
@Mytherin Mytherin merged commit 8e2c944 into duckdb:main Nov 19, 2024
@Mytherin Mytherin deleted the topnfix branch December 8, 2024 06:51
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 24, 2024
Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900)
python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 24, 2024
Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900)
python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A confusion about Top-N operator

1 participant