Skip to content

Conversation

@lidavidm
Copy link
Member

No description provided.

@github-actions
Copy link

@lidavidm
Copy link
Member Author

lidavidm commented Jul 27, 2021

This is just to see how the hash aggregate kernels perform compared to the dedicated scalar aggregation kernels in the case that there is only one group.

I only tested MinMax and Count. I didn't test Sum/Mean since the scalar kernels use pairwise summation and the hash aggregate kernels use naive summation.

Unfortunately, it's rather terrible. For count:

Details
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
CountKernelBenchInt64/1048576/2                3067 ns         3067 ns       457331 bytes_per_second=318.4G/s null_percent=50 size=1048.58k
CountKernelBenchInt64Aggregate/1048576/2     412162 ns       412138 ns         3238 bytes_per_second=2.3695G/s null_percent=50 size=1048.58k

At 2 orders of magnitude slower, the hash aggregate kernel isn't anywhere near the dedicated scalar one. The scalar kernel essentially just calls CountSetBits, while the hash aggregate kernel must use VisitSetBitRuns and index into a length-1 array of counts. Also, a good amount of time (~10% of the runtime according to perf) is spent just allocating and filling an array of group IDs to use at the start.

For min_max the story is not so clear. The hash aggregate kernel actually wins for floats, but loses badly (not as badly as with Count) for integers.

Details
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
MinMaxKernelFloat/1048576/10000                  981 us          981 us          638 bytes_per_second=1018.91M/s null_percent=0.01 size=1048.58k
MinMaxKernelFloat/1048576/100                   1008 us         1008 us          723 bytes_per_second=992.523M/s null_percent=1 size=1048.58k
MinMaxKernelFloat/1048576/10                    1062 us         1062 us          561 bytes_per_second=941.703M/s null_percent=10 size=1048.58k
MinMaxKernelFloat/1048576/2                     1424 us         1424 us          456 bytes_per_second=702.401M/s null_percent=50 size=1048.58k
MinMaxKernelFloat/1048576/1                     6.92 us         6.92 us       105816 bytes_per_second=141.155G/s null_percent=100 size=1048.58k
MinMaxKernelFloat/1048576/0                      900 us          900 us          815 bytes_per_second=1111.18M/s null_percent=0 size=1048.58k
MinMaxKernelFloatAggregate/1048576/10000         667 us          667 us         1103 bytes_per_second=1.46325G/s null_percent=0.01 size=1048.58k
MinMaxKernelFloatAggregate/1048576/100           654 us          654 us          924 bytes_per_second=1.49389G/s null_percent=1 size=1048.58k
MinMaxKernelFloatAggregate/1048576/10            765 us          765 us          965 bytes_per_second=1.27599G/s null_percent=10 size=1048.58k
MinMaxKernelFloatAggregate/1048576/2            1267 us         1267 us          585 bytes_per_second=789.267M/s null_percent=50 size=1048.58k
MinMaxKernelFloatAggregate/1048576/1             421 us          421 us         1693 bytes_per_second=2.32129G/s null_percent=100 size=1048.58k
MinMaxKernelFloatAggregate/1048576/0             668 us          668 us         1107 bytes_per_second=1.46147G/s null_percent=0 size=1048.58k
MinMaxKernelDouble/1048576/10000                 420 us          420 us         1712 bytes_per_second=2.32776G/s null_percent=0.01 size=1048.58k
MinMaxKernelDouble/1048576/100                   465 us          465 us         1412 bytes_per_second=2.10164G/s null_percent=1 size=1048.58k
MinMaxKernelDouble/1048576/10                    592 us          592 us         1168 bytes_per_second=1.64947G/s null_percent=10 size=1048.58k
MinMaxKernelDouble/1048576/2                     730 us          730 us         1008 bytes_per_second=1.33826G/s null_percent=50 size=1048.58k
MinMaxKernelDouble/1048576/1                    4.10 us         4.10 us       177426 bytes_per_second=238.21G/s null_percent=100 size=1048.58k
MinMaxKernelDouble/1048576/0                     540 us          540 us         1000 bytes_per_second=1.80829G/s null_percent=0 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/10000        342 us          342 us         2106 bytes_per_second=2.85799G/s null_percent=0.01 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/100          346 us          346 us         2136 bytes_per_second=2.82629G/s null_percent=1 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/10           385 us          385 us         1911 bytes_per_second=2.53959G/s null_percent=10 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/2            631 us          631 us         1163 bytes_per_second=1.54829G/s null_percent=50 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/1            218 us          218 us         3413 bytes_per_second=4.48758G/s null_percent=100 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/0            334 us          334 us         2193 bytes_per_second=2.92247G/s null_percent=0 size=1048.58k
MinMaxKernelInt8/1048576/10000                   571 us          571 us         1293 bytes_per_second=1.71088G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8/1048576/100                     986 us          986 us          742 bytes_per_second=1014.35M/s null_percent=1 size=1048.58k
MinMaxKernelInt8/1048576/10                     1818 us         1818 us          402 bytes_per_second=550.013M/s null_percent=10 size=1048.58k
MinMaxKernelInt8/1048576/2                      4039 us         4039 us          182 bytes_per_second=247.588M/s null_percent=50 size=1048.58k
MinMaxKernelInt8/1048576/1                      22.9 us         22.9 us        31922 bytes_per_second=42.701G/s null_percent=100 size=1048.58k
MinMaxKernelInt8/1048576/0                       546 us          546 us         1368 bytes_per_second=1.78943G/s null_percent=0 size=1048.58k
MinMaxKernelInt8Aggregate/1048576/10000         2241 us         2241 us          325 bytes_per_second=446.241M/s null_percent=0.01 size=1048.58k
MinMaxKernelInt8Aggregate/1048576/100           2497 us         2497 us          299 bytes_per_second=400.494M/s null_percent=1 size=1048.58k
MinMaxKernelInt8Aggregate/1048576/10            3144 us         3143 us          237 bytes_per_second=318.129M/s null_percent=10 size=1048.58k
MinMaxKernelInt8Aggregate/1048576/2             5107 us         5107 us          100 bytes_per_second=195.815M/s null_percent=50 size=1048.58k
MinMaxKernelInt8Aggregate/1048576/1             1581 us         1581 us          386 bytes_per_second=632.705M/s null_percent=100 size=1048.58k
MinMaxKernelInt8Aggregate/1048576/0             2093 us         2093 us          274 bytes_per_second=477.747M/s null_percent=0 size=1048.58k
MinMaxKernelInt16/1048576/10000                  274 us          274 us         2329 bytes_per_second=3.56594G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16/1048576/100                    459 us          459 us         1429 bytes_per_second=2.12904G/s null_percent=1 size=1048.58k
MinMaxKernelInt16/1048576/10                     842 us          842 us          698 bytes_per_second=1.1593G/s null_percent=10 size=1048.58k
MinMaxKernelInt16/1048576/2                     1997 us         1997 us          370 bytes_per_second=500.784M/s null_percent=50 size=1048.58k
MinMaxKernelInt16/1048576/1                     12.1 us         12.1 us        60436 bytes_per_second=80.6769G/s null_percent=100 size=1048.58k
MinMaxKernelInt16/1048576/0                      271 us          271 us         2713 bytes_per_second=3.60785G/s null_percent=0 size=1048.58k
MinMaxKernelInt16Aggregate/1048576/10000        1226 us         1226 us          615 bytes_per_second=815.951M/s null_percent=0.01 size=1048.58k
MinMaxKernelInt16Aggregate/1048576/100          1326 us         1326 us          564 bytes_per_second=753.999M/s null_percent=1 size=1048.58k
MinMaxKernelInt16Aggregate/1048576/10           1608 us         1608 us          462 bytes_per_second=622M/s null_percent=10 size=1048.58k
MinMaxKernelInt16Aggregate/1048576/2            2700 us         2700 us          275 bytes_per_second=370.316M/s null_percent=50 size=1048.58k
MinMaxKernelInt16Aggregate/1048576/1             867 us          866 us          868 bytes_per_second=1.12704G/s null_percent=100 size=1048.58k
MinMaxKernelInt16Aggregate/1048576/0            1190 us         1190 us          620 bytes_per_second=840.448M/s null_percent=0 size=1048.58k
MinMaxKernelInt32/1048576/10000                  139 us          139 us         4389 bytes_per_second=7.03596G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32/1048576/100                    238 us          238 us         2483 bytes_per_second=4.10169G/s null_percent=1 size=1048.58k
MinMaxKernelInt32/1048576/10                     515 us          515 us         1000 bytes_per_second=1.89702G/s null_percent=10 size=1048.58k
MinMaxKernelInt32/1048576/2                     1021 us         1021 us          722 bytes_per_second=979.116M/s null_percent=50 size=1048.58k
MinMaxKernelInt32/1048576/1                     6.55 us         6.55 us       109640 bytes_per_second=149.127G/s null_percent=100 size=1048.58k
MinMaxKernelInt32/1048576/0                      132 us          132 us         4723 bytes_per_second=7.4224G/s null_percent=0 size=1048.58k
MinMaxKernelInt32Aggregate/1048576/10000         631 us          631 us         1171 bytes_per_second=1.54789G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt32Aggregate/1048576/100           703 us          703 us         1096 bytes_per_second=1.38954G/s null_percent=1 size=1048.58k
MinMaxKernelInt32Aggregate/1048576/10            808 us          808 us          911 bytes_per_second=1.20892G/s null_percent=10 size=1048.58k
MinMaxKernelInt32Aggregate/1048576/2            1304 us         1304 us          564 bytes_per_second=766.908M/s null_percent=50 size=1048.58k
MinMaxKernelInt32Aggregate/1048576/1             420 us          420 us         1758 bytes_per_second=2.32421G/s null_percent=100 size=1048.58k
MinMaxKernelInt32Aggregate/1048576/0             624 us          624 us         1183 bytes_per_second=1.56476G/s null_percent=0 size=1048.58k
MinMaxKernelInt64/1048576/10000                 73.8 us         73.8 us         9720 bytes_per_second=13.2297G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/100                    113 us          113 us         5369 bytes_per_second=8.65305G/s null_percent=1 size=1048.58k
MinMaxKernelInt64/1048576/10                     206 us          206 us         3134 bytes_per_second=4.74316G/s null_percent=10 size=1048.58k
MinMaxKernelInt64/1048576/2                      516 us          516 us         1000 bytes_per_second=1.89125G/s null_percent=50 size=1048.58k
MinMaxKernelInt64/1048576/1                     3.89 us         3.89 us       187314 bytes_per_second=250.863G/s null_percent=100 size=1048.58k
MinMaxKernelInt64/1048576/0                     71.5 us         71.5 us        10264 bytes_per_second=13.6533G/s null_percent=0 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/10000         305 us          305 us         2414 bytes_per_second=3.19811G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/100           334 us          334 us         2210 bytes_per_second=2.92146G/s null_percent=1 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/10            407 us          407 us         1832 bytes_per_second=2.40181G/s null_percent=10 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/2             654 us          654 us         1117 bytes_per_second=1.49329G/s null_percent=50 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/1             217 us          217 us         3416 bytes_per_second=4.50848G/s null_percent=100 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/0             302 us          302 us         2430 bytes_per_second=3.23243G/s null_percent=0 size=1048.58k

For integers, the scalar kernel has a SIMD variant that gets leveraged. For doubles, it's unclear to me why the scalar kernel is slower; they both boil down to just min/max calls, though the scalar kernel uses fmin/fmax and the hash kernel uses std::min/std::max.

@lidavidm
Copy link
Member Author

Ah, and for min_max, the scalar kernel becomes much faster if it calls std::max instead of std::fmax:

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
MinMaxKernelDouble/1048576/10000                 192 us          192 us         7333 bytes_per_second=5.09496G/s null_percent=0.01 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/10000        347 us          347 us         3754 bytes_per_second=2.81357G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/10000                 74.8 us         74.8 us        19546 bytes_per_second=13.0498G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/10000         301 us          301 us         4941 bytes_per_second=3.24722G/s null_percent=0.01 size=1048.58k

So in all cases the hash aggregate kernel is considerably slower.

@pitrou
Copy link
Member

pitrou commented Jul 27, 2021

std::max seems to have different semantics than std::fmax for NaNs.

@lidavidm
Copy link
Member Author

Ah indeed. Testing it with std::fmax in both kernels, the hash aggregate kernel is slower in all cases:

----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------
MinMaxKernelDouble/1048576/10000                 495 us          495 us         2897 bytes_per_second=1.97278G/s null_percent=0.01 size=1048.58k
MinMaxKernelDoubleAggregate/1048576/10000        645 us          645 us         2160 bytes_per_second=1.5149G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64/1048576/10000                 80.1 us         80.1 us        16304 bytes_per_second=12.1949G/s null_percent=0.01 size=1048.58k
MinMaxKernelInt64Aggregate/1048576/10000         718 us          718 us         1897 bytes_per_second=1.35927G/s null_percent=0.01 size=1048.58k

@bkietz
Copy link
Member

bkietz commented Jul 27, 2021

Well that's disappointing. I'll rewrite the JIRA some- we could still consolidate to a single compute::Function and decide between scalar and grouped aggregation in Dispatch*

@lidavidm lidavidm closed this Jul 27, 2021
@bkietz
Copy link
Member

bkietz commented Jul 27, 2021

Thanks for looking into this, @lidavidm

@lidavidm
Copy link
Member Author

It would have been nice to maintain half as many aggregate kernels. As you say, we could consolidate the implementations into a single logical function (we'd have to consolidate HashAggregateFunction and ScalarAggregateFunction, though?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants