ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution #7267

frankdjx · 2020-05-25T08:21:23Z

Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64).
Unrolled the aggregate kernel dense part to speculative add the result in parrel.

Signed-off-by: Frank Du frank.du@intel.com

1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64). 2. Unlooped the aggregate kernel dense part to speculative add the result in parrel. Signed-off-by: Frank Du <frank.du@intel.com>

frankdjx · 2020-05-25T08:29:30Z

We find the SumKernel benchmark dense(null percent 0) results is relatively low compared to sparse part for float and double type. Below is the result before unrolled the loop.

SumKernelFloat/32768/0 9.12 us 9.11 us 76809 bytes_per_second=3.35021G/s null_percent=0
SumKernelFloat/32768/1 4.38 us 4.38 us 159861 bytes_per_second=6.97134G/s null_percent=1
SumKernelDouble/32768/0 4.99 us 4.99 us 140190 bytes_per_second=6.11937G/s null_percent=0
SumKernelDouble/32768/1 2.14 us 2.14 us 327897 bytes_per_second=14.246G/s null_percent=1

With the unroll change, the dense sumkernel benchmark get 3.7x improvement for float and 2.6x speed for double.

SumKernelFloat/32768/0 2.46 us 2.46 us 285185 bytes_per_second=12.4269G/s null_percent=0
SumKernelDouble/32768/0 1.90 us 1.90 us 370921 bytes_per_second=16.0846G/s null_percent=0

Anyway, it can get more higher performance if using intrinsic, I'd like to work at later point.

github-actions · 2020-05-25T08:31:46Z

https://issues.apache.org/jira/browse/ARROW-8772

wesm · 2020-05-25T13:51:59Z

@jianxind thank you for looking into this! I had suspected that there was room for improvement in this algorithm. You're certainly free to implement kernels requiring intrinsics and put them in aggregate_basic_$SIMD_VERSION.cc, and then we can utilize the CpuInfo from ExecContext during kernel dispatch to select a supported kernel.

wesm · 2020-05-25T14:32:25Z

This shows a small benefit on ARM64 architecture (ThunderX1):

before:

-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
SumKernel/32768/0        19.5 us         19.5 us        35883 bytes_per_second=1.56853G/s null_percent=0 size=32.768k
SumKernel/32768/1        19.2 us         19.2 us        36172 bytes_per_second=1.58548G/s null_percent=1 size=32.768k
SumKernel/32768/10       21.7 us         21.7 us        32230 bytes_per_second=1.40723G/s null_percent=10 size=32.768k
SumKernel/32768/50       22.4 us         22.4 us        31172 bytes_per_second=1.36246G/s null_percent=50 size=32.768k

after (see the Int64 benchmarks):

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
SumKernelFloat/32768/0         25.4 us         25.4 us        27590 bytes_per_second=1.20019G/s null_percent=0 size=32.768k
SumKernelFloat/32768/1         41.2 us         41.2 us        16973 bytes_per_second=758.715M/s null_percent=1 size=32.768k
SumKernelFloat/32768/10        49.0 us         49.0 us        14285 bytes_per_second=638.254M/s null_percent=10 size=32.768k
SumKernelFloat/32768/50        51.5 us         51.5 us        13625 bytes_per_second=606.657M/s null_percent=50 size=32.768k
SumKernelDouble/32768/0        18.5 us         18.5 us        37953 bytes_per_second=1.65038G/s null_percent=0 size=32.768k
SumKernelDouble/32768/1        26.2 us         26.2 us        26852 bytes_per_second=1.16631G/s null_percent=1 size=32.768k
SumKernelDouble/32768/10       27.3 us         27.3 us        25685 bytes_per_second=1.11778G/s null_percent=10 size=32.768k
SumKernelDouble/32768/50       27.7 us         27.7 us        25298 bytes_per_second=1.10219G/s null_percent=50 size=32.768k
SumKernelInt8/32768/0          64.4 us         64.4 us        10857 bytes_per_second=485.21M/s null_percent=0 size=32.768k
SumKernelInt8/32768/1          69.8 us         69.8 us        10025 bytes_per_second=448.008M/s null_percent=1 size=32.768k
SumKernelInt8/32768/10         92.7 us         92.7 us         7554 bytes_per_second=337.084M/s null_percent=10 size=32.768k
SumKernelInt8/32768/50          100 us          100 us         6974 bytes_per_second=311.424M/s null_percent=50 size=32.768k
SumKernelInt16/32768/0         41.0 us         41.0 us        17076 bytes_per_second=762.469M/s null_percent=0 size=32.768k
SumKernelInt16/32768/1         42.5 us         42.5 us        16513 bytes_per_second=735.553M/s null_percent=1 size=32.768k
SumKernelInt16/32768/10        53.7 us         53.7 us        12977 bytes_per_second=581.673M/s null_percent=10 size=32.768k
SumKernelInt16/32768/50        57.7 us         57.7 us        12103 bytes_per_second=541.264M/s null_percent=50 size=32.768k
SumKernelInt32/32768/0         26.0 us         26.0 us        26859 bytes_per_second=1.17154G/s null_percent=0 size=32.768k
SumKernelInt32/32768/1         26.5 us         26.5 us        26555 bytes_per_second=1.15362G/s null_percent=1 size=32.768k
SumKernelInt32/32768/10        32.3 us         32.3 us        21653 bytes_per_second=967.29M/s null_percent=10 size=32.768k
SumKernelInt32/32768/50        34.2 us         34.2 us        20513 bytes_per_second=914.616M/s null_percent=50 size=32.768k
SumKernelInt64/32768/0         18.0 us         18.0 us        39059 bytes_per_second=1.69949G/s null_percent=0 size=32.768k
SumKernelInt64/32768/1         19.3 us         19.3 us        36220 bytes_per_second=1.58067G/s null_percent=1 size=32.768k
SumKernelInt64/32768/10        21.8 us         21.8 us        32171 bytes_per_second=1.40112G/s null_percent=10 size=32.768k
SumKernelInt64/32768/50        22.5 us         22.5 us        31062 bytes_per_second=1.35471G/s null_percent=50 size=32.768k

wesm

+1

frankdjx · 2020-05-26T00:55:04Z

Indeed the dense performance issue only happens for float and double type, compiler did a good job for all int types, that is why I add all data types to sumkernel benchmark.

@jianxind thank you for looking into this! I had suspected that there was room for improvement in this algorithm. You're certainly free to implement kernels requiring intrinsics and put them in aggregate_basic_$SIMD_VERSION.cc, and then we can utilize the CpuInfo from ExecContext during kernel dispatch to select a supported kernel.

Sure I will work on this SIMD chance then, thanks.

…cution 1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64). 2. Unrolled the aggregate kernel dense part to speculative add the result in parrel. Signed-off-by: Frank Du <frank.du@intel.com> Closes apache#7267 from jianxind/SumKernelBenchmark Authored-by: Frank Du <frank.du@intel.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

fsaintjacques · 2020-05-26T17:54:24Z

I just wanted to point that this vary greatly depending on the toolchain, e.g. LLVM is usually more aggressive at unrolling than GCC. Since most of our users (python, R) will have libarrow built with gcc, I think it's fine.

[C++] Unrooled aggregate dense for better speculative execution

c53d9ce

1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64). 2. Unlooped the aggregate kernel dense part to speculative add the result in parrel. Signed-off-by: Frank Du <frank.du@intel.com>

wesm changed the title ~~ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution~~ ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution May 25, 2020

wesm requested a review from fsaintjacques May 25, 2020 13:52

wesm approved these changes May 26, 2020

View reviewed changes

wesm closed this in ac74b3b May 26, 2020

frankdjx deleted the SumKernelBenchmark branch May 26, 2020 00:59

asfimport mentioned this pull request May 27, 2020

[C++] Expand SumKernel benchmark to more types #17177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution #7267

ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution #7267

Uh oh!

frankdjx commented May 25, 2020 •

edited

Loading

Uh oh!

frankdjx commented May 25, 2020 •

edited

Loading

Uh oh!

github-actions bot commented May 25, 2020

Uh oh!

wesm commented May 25, 2020 •

edited

Loading

Uh oh!

wesm commented May 25, 2020 •

edited

Loading

Uh oh!

wesm left a comment

Uh oh!

frankdjx commented May 26, 2020

Uh oh!

fsaintjacques commented May 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution #7267

ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution #7267

Uh oh!

Conversation

frankdjx commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankdjx commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 25, 2020

Uh oh!

wesm commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

frankdjx commented May 26, 2020

Uh oh!

fsaintjacques commented May 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frankdjx commented May 25, 2020 •

edited

Loading

frankdjx commented May 25, 2020 •

edited

Loading

wesm commented May 25, 2020 •

edited

Loading

wesm commented May 25, 2020 •

edited

Loading