Top-N: Improve performance with large heaps, and correctly call Reduce by Mytherin · Pull Request #14900 · duckdb/duckdb

Mytherin · 2024-11-19T13:37:56Z

Follow-up from #14424

Large Heaps

When using a large offset, the heap we generate in the Top-N operator is large. The previous implementation took some shortcuts that made it work poorly with large heaps, resulting in performance degradations compared to the previous implementation:

TopNHeap::Combine would scan and re-insert the values in the heaps, instead of directly merging heaps
TopNHeap::Sink would fully scan the heap at every iteration

This PR fixes these issues.

`AddSmallHeap`/`AddLargeHeap`

The previous heap insertion happened in two stages:

Insert the sort keys one-by-one into the heap
Scan the heap to figure out which final sort keys still remain inside the heap
Append the payload for the corresponding final sort keys

The main reason for this two-stage approach is to deal with many conflicts within the same data chunk. For example, consider the query:

SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;

Since lineitem is sorted by l_orderkey, every chunk we stream into the operator will be full of the new highest values. Without this two-stage approach, we will append all rows to the payload data, only to then discard most rows again. Using the two stage approach we append at most 5 rows (the heap size) to the payload at each iteration, leading to lower memory usage and fewer calls to Reduce.

AddLargeHeap

This PR adds a new, simpler, single-stage insertion where we insert the sort keys and immediately insert payload data. We switch to this approach for heaps >= 100 rows. By immediately inserting the payload data we don't need to scan the heap during the sink phase, which has great speed-ups for when we are dealing with larger heaps.

Benchmarks

Below is an example of a Top-N with a large heap that is sped up significantly by this approach:

select l_orderkey
from lineitem
where l_linestatus = 'O'
order by l_quantity limit 100 offset 1000000;

v1.1	main	new
0.86s	6.9s	0.77s

…in every sink iteration

Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900) python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895)

Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900) python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

Mytherin added 10 commits November 19, 2024 12:03

Fix for reduce: correctly reduce heap data

a49fc3d

Add separate AddLargeHeap method that avoids fully scanning the heap …

ec7b4b7

…in every sink iteration

Avoid unnecessary resize in TopNHeap::InitializeScan

bfa473f

HeapAllocSize can be truncated for large heaps

ba84c29

Rename to InitialHeapAllocSize to avoid confusion

10fae73

Also reserve

f8d54e1

Top-N: more efficient combine - directly combine heaps

21a5d50

Add test + move sort key in correct direction

caa7ede

AddSmallHeap with heap_size <= 100

d2d8365

constexpr

81b251d

Mytherin mentioned this pull request Nov 19, 2024

A confusion about Top-N operator #14896

Closed

2 tasks

Add missing rowsort

ae85909

duckdb-draftbot marked this pull request as draft November 19, 2024 14:14

Mytherin marked this pull request as ready for review November 19, 2024 14:16

Mytherin merged commit 8e2c944 into duckdb:main Nov 19, 2024

Mytherin deleted the topnfix branch December 8, 2024 06:51

github-actions bot mentioned this pull request Dec 24, 2024

vendor: Update vendored sources to duckdb/duckdb@8e2c944aac9b39b0482569e1a2ada87e035c2d57 duckdb/duckdb-r#711

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Top-N: Improve performance with large heaps, and correctly call Reduce#14900

Top-N: Improve performance with large heaps, and correctly call Reduce#14900
Mytherin merged 11 commits intoduckdb:mainfrom
Mytherin:topnfix

Mytherin commented Nov 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mytherin commented Nov 19, 2024

Large Heaps

AddSmallHeap/AddLargeHeap

AddLargeHeap

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`AddSmallHeap`/`AddLargeHeap`