Skip to content

LIST aggregate performance improvements#6995

Merged
Mytherin merged 15 commits intoduckdb:masterfrom
Mytherin:listimprovements
Apr 7, 2023
Merged

LIST aggregate performance improvements#6995
Mytherin merged 15 commits intoduckdb:masterfrom
Mytherin:listimprovements

Conversation

@Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Apr 6, 2023

This PR improves the performance of the LIST aggregate in several ways:

  • Use an ArenaAllocator that is stored in the aggregate/perfect hash table to do the allocation of the list segments
  • Avoid storing the AllocatedData in a vector in the aggregate state of the list - instead use the provided allocator in the destructor to de-allocate the segments
  • Avoid allocating the root LinkedList and instead store this as part of the hash table
  • Avoid storing the LogicalType in the aggregate state of the list - instead use the type from the bind data

This results in many fewer allocations made, particularly when there are many small lists. They also improve data locality as a result of using large heaps for the allocations. Here are some performance numbers for the two included benchmarks:

Benchmark Old New
Many Small Lists (150K lists, 1-7 rows per list) 3.5s 0.06s
Few Large Lists (3 lists, 140K - 300K rows per list) 0.03s 0.02s

These changes require a few modifications to the aggregate hash tables and row layout, particularly to pass in the allocator from the hash table to the actual hash function:

  • The destroy function now also takes an AggregateFunctionInput
  • The RowOperations methods take a RowOperationsState, which contains an Allocator
  • The AggregateHashTable and PerfectAggregateHashTable are extended with an ArenaAllocator object

For the scan of the PhysicalHashAggregate we also launch multiple threads sooner - instead of only launching one thread per row group size we launch one thread per vector. Hash tables are scanned per-vector, and multiple threads processing the output of a hash table becomes beneficial already with low row counts when lists are involved.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Apr 6, 2023

CC @Tmonster this PR makes the following query on the H2OAI benchmark faster as a substitute for the current window function for Advanced Group By Q3:

SELECT id6, unnest(list_sort(list(v3), 'desc')[1:2]) AS v3 FROM (SELECT id6, v3 FROM x_group WHERE v3 IS NOT NULL) AS subq GROUP BY id6;
create table ans as SELECT id6, v3 AS largest2_v3 FROM (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x_group WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2;
-- 0.4s
create table ans as SELECT id6, unnest(list_sort(list(v3), 'desc')[1:2]) AS v3 FROM (SELECT id6, v3 FROM x_group WHERE v3 IS NOT NULL) AS subq GROUP BY id6;
-- 0.32s

Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I definitely will have merge conflicts with this, but nothing too serious. I wrote down some thoughts:

vector<ExpressionType> predicates;

//! The arena allocator used by the aggregates for their internal state
ArenaAllocator aggregate_allocator;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have many distinct groups and have to use partitioning? Won't the ArenaAllocator of the initial (unpartitioned) HT still be the owner of the allocated data?

Although I assume the receiving aggregate state is responsible for ownership of the data from the giving aggregate state when they are combined by, e.g., copying it. For the list aggregate function we use a linked list, so in theory, we could do a zero-copy combine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Combine currently always copies over the data and keeps the original data in-tact, which allows this to work as-is. For a zero-copy combine we would need to move over the allocators as well, but I will leave that for a future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are only allowed to take ownership of the incoming state during combine if you know you are not being called by the windowing logic (the segment tree code uses combine on elements of the tree). One notable (and relevant) example is the sorted aggregate wrapper, which is only used by regular aggregation. This will probably not change if we add ORDER BY argument clauses to windowed aggregates as I expect we will use Adrian's stuff instead.

@lnkuiper lnkuiper mentioned this pull request Apr 7, 2023
Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool! I added some minor comments, mostly related to clarifying things for me.

I am also linking #3788 here since this PR addresses one of the two missing points I raised there.

: function(std::move(function)), bind_data(bind_data), child_count(child_count), payload_size(payload_size),
aggr_type(aggr_type), return_type(return_type), filter(filter) {
: function(std::move(function)),
bind_data_wrapper(bind_data ? make_shared<FunctionDataWrapper>(bind_data->Copy()) : nullptr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this might be a bit tough to read.

@Mytherin Mytherin merged commit 30d444e into duckdb:master Apr 7, 2023
@Mytherin Mytherin deleted the listimprovements branch April 24, 2023 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants