LIST aggregate performance improvements#6995
Conversation
…nto an intermediate vector and then appending
…LinkedList into references instead of pointers
…d in in destructor to de-allocate
…te/perfect aggregate HT
|
CC @Tmonster this PR makes the following query on the H2OAI benchmark faster as a substitute for the current window function for Advanced Group By Q3: SELECT id6, unnest(list_sort(list(v3), 'desc')[1:2]) AS v3 FROM (SELECT id6, v3 FROM x_group WHERE v3 IS NOT NULL) AS subq GROUP BY id6;create table ans as SELECT id6, v3 AS largest2_v3 FROM (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x_group WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2;
-- 0.4s
create table ans as SELECT id6, unnest(list_sort(list(v3), 'desc')[1:2]) AS v3 FROM (SELECT id6, v3 FROM x_group WHERE v3 IS NOT NULL) AS subq GROUP BY id6;
-- 0.32s |
…implify memory management of aggregate hash tables
lnkuiper
left a comment
There was a problem hiding this comment.
Looks good to me! I definitely will have merge conflicts with this, but nothing too serious. I wrote down some thoughts:
| vector<ExpressionType> predicates; | ||
|
|
||
| //! The arena allocator used by the aggregates for their internal state | ||
| ArenaAllocator aggregate_allocator; |
There was a problem hiding this comment.
What if we have many distinct groups and have to use partitioning? Won't the ArenaAllocator of the initial (unpartitioned) HT still be the owner of the allocated data?
Although I assume the receiving aggregate state is responsible for ownership of the data from the giving aggregate state when they are combined by, e.g., copying it. For the list aggregate function we use a linked list, so in theory, we could do a zero-copy combine.
There was a problem hiding this comment.
The Combine currently always copies over the data and keeps the original data in-tact, which allows this to work as-is. For a zero-copy combine we would need to move over the allocators as well, but I will leave that for a future PR.
There was a problem hiding this comment.
You are only allowed to take ownership of the incoming state during combine if you know you are not being called by the windowing logic (the segment tree code uses combine on elements of the tree). One notable (and relevant) example is the sorted aggregate wrapper, which is only used by regular aggregation. This will probably not change if we add ORDER BY argument clauses to windowed aggregates as I expect we will use Adrian's stuff instead.
taniabogatsch
left a comment
There was a problem hiding this comment.
This is pretty cool! I added some minor comments, mostly related to clarifying things for me.
I am also linking #3788 here since this PR addresses one of the two missing points I raised there.
| : function(std::move(function)), bind_data(bind_data), child_count(child_count), payload_size(payload_size), | ||
| aggr_type(aggr_type), return_type(return_type), filter(filter) { | ||
| : function(std::move(function)), | ||
| bind_data_wrapper(bind_data ? make_shared<FunctionDataWrapper>(bind_data->Copy()) : nullptr), |
There was a problem hiding this comment.
Nit: this might be a bit tough to read.
This PR improves the performance of the
LISTaggregate in several ways:ArenaAllocatorthat is stored in the aggregate/perfect hash table to do the allocation of the list segmentsAllocatedDatain a vector in the aggregate state of the list - instead use the provided allocator in the destructor to de-allocate the segmentsLinkedListand instead store this as part of the hash tableLogicalTypein the aggregate state of the list - instead use the type from the bind dataThis results in many fewer allocations made, particularly when there are many small lists. They also improve data locality as a result of using large heaps for the allocations. Here are some performance numbers for the two included benchmarks:
These changes require a few modifications to the aggregate hash tables and row layout, particularly to pass in the allocator from the hash table to the actual hash function:
AggregateFunctionInputRowOperationsmethods take aRowOperationsState, which contains an AllocatorAggregateHashTableandPerfectAggregateHashTableare extended with anArenaAllocatorobjectFor the scan of the
PhysicalHashAggregatewe also launch multiple threads sooner - instead of only launching one thread per row group size we launch one thread per vector. Hash tables are scanned per-vector, and multiple threads processing the output of a hash table becomes beneficial already with low row counts when lists are involved.