Skip to content

Conversation

@frankdjx
Copy link
Contributor

@frankdjx frankdjx commented May 25, 2020

  1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64).
  2. Unrolled the aggregate kernel dense part to speculative add the result in parrel.

Signed-off-by: Frank Du frank.du@intel.com

1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64).
2. Unlooped the aggregate kernel dense part to speculative add the result in parrel.

Signed-off-by: Frank Du <frank.du@intel.com>
@frankdjx
Copy link
Contributor Author

frankdjx commented May 25, 2020

We find the SumKernel benchmark dense(null percent 0) results is relatively low compared to sparse part for float and double type. Below is the result before unrolled the loop.

  • SumKernelFloat/32768/0 9.12 us 9.11 us 76809 bytes_per_second=3.35021G/s null_percent=0
  • SumKernelFloat/32768/1 4.38 us 4.38 us 159861 bytes_per_second=6.97134G/s null_percent=1
  • SumKernelDouble/32768/0 4.99 us 4.99 us 140190 bytes_per_second=6.11937G/s null_percent=0
  • SumKernelDouble/32768/1 2.14 us 2.14 us 327897 bytes_per_second=14.246G/s null_percent=1

With the unroll change, the dense sumkernel benchmark get 3.7x improvement for float and 2.6x speed for double.

  • SumKernelFloat/32768/0 2.46 us 2.46 us 285185 bytes_per_second=12.4269G/s null_percent=0
  • SumKernelDouble/32768/0 1.90 us 1.90 us 370921 bytes_per_second=16.0846G/s null_percent=0

Anyway, it can get more higher performance if using intrinsic, I'd like to work at later point.

@github-actions
Copy link

@wesm
Copy link
Member

wesm commented May 25, 2020

@jianxind thank you for looking into this! I had suspected that there was room for improvement in this algorithm. You're certainly free to implement kernels requiring intrinsics and put them in aggregate_basic_$SIMD_VERSION.cc, and then we can utilize the CpuInfo from ExecContext during kernel dispatch to select a supported kernel.

@wesm wesm changed the title ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution May 25, 2020
@wesm wesm requested a review from fsaintjacques May 25, 2020 13:52
@wesm
Copy link
Member

wesm commented May 25, 2020

This shows a small benefit on ARM64 architecture (ThunderX1):

before:

-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
SumKernel/32768/0        19.5 us         19.5 us        35883 bytes_per_second=1.56853G/s null_percent=0 size=32.768k
SumKernel/32768/1        19.2 us         19.2 us        36172 bytes_per_second=1.58548G/s null_percent=1 size=32.768k
SumKernel/32768/10       21.7 us         21.7 us        32230 bytes_per_second=1.40723G/s null_percent=10 size=32.768k
SumKernel/32768/50       22.4 us         22.4 us        31172 bytes_per_second=1.36246G/s null_percent=50 size=32.768k

after (see the Int64 benchmarks):

-----------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------
SumKernelFloat/32768/0         25.4 us         25.4 us        27590 bytes_per_second=1.20019G/s null_percent=0 size=32.768k
SumKernelFloat/32768/1         41.2 us         41.2 us        16973 bytes_per_second=758.715M/s null_percent=1 size=32.768k
SumKernelFloat/32768/10        49.0 us         49.0 us        14285 bytes_per_second=638.254M/s null_percent=10 size=32.768k
SumKernelFloat/32768/50        51.5 us         51.5 us        13625 bytes_per_second=606.657M/s null_percent=50 size=32.768k
SumKernelDouble/32768/0        18.5 us         18.5 us        37953 bytes_per_second=1.65038G/s null_percent=0 size=32.768k
SumKernelDouble/32768/1        26.2 us         26.2 us        26852 bytes_per_second=1.16631G/s null_percent=1 size=32.768k
SumKernelDouble/32768/10       27.3 us         27.3 us        25685 bytes_per_second=1.11778G/s null_percent=10 size=32.768k
SumKernelDouble/32768/50       27.7 us         27.7 us        25298 bytes_per_second=1.10219G/s null_percent=50 size=32.768k
SumKernelInt8/32768/0          64.4 us         64.4 us        10857 bytes_per_second=485.21M/s null_percent=0 size=32.768k
SumKernelInt8/32768/1          69.8 us         69.8 us        10025 bytes_per_second=448.008M/s null_percent=1 size=32.768k
SumKernelInt8/32768/10         92.7 us         92.7 us         7554 bytes_per_second=337.084M/s null_percent=10 size=32.768k
SumKernelInt8/32768/50          100 us          100 us         6974 bytes_per_second=311.424M/s null_percent=50 size=32.768k
SumKernelInt16/32768/0         41.0 us         41.0 us        17076 bytes_per_second=762.469M/s null_percent=0 size=32.768k
SumKernelInt16/32768/1         42.5 us         42.5 us        16513 bytes_per_second=735.553M/s null_percent=1 size=32.768k
SumKernelInt16/32768/10        53.7 us         53.7 us        12977 bytes_per_second=581.673M/s null_percent=10 size=32.768k
SumKernelInt16/32768/50        57.7 us         57.7 us        12103 bytes_per_second=541.264M/s null_percent=50 size=32.768k
SumKernelInt32/32768/0         26.0 us         26.0 us        26859 bytes_per_second=1.17154G/s null_percent=0 size=32.768k
SumKernelInt32/32768/1         26.5 us         26.5 us        26555 bytes_per_second=1.15362G/s null_percent=1 size=32.768k
SumKernelInt32/32768/10        32.3 us         32.3 us        21653 bytes_per_second=967.29M/s null_percent=10 size=32.768k
SumKernelInt32/32768/50        34.2 us         34.2 us        20513 bytes_per_second=914.616M/s null_percent=50 size=32.768k
SumKernelInt64/32768/0         18.0 us         18.0 us        39059 bytes_per_second=1.69949G/s null_percent=0 size=32.768k
SumKernelInt64/32768/1         19.3 us         19.3 us        36220 bytes_per_second=1.58067G/s null_percent=1 size=32.768k
SumKernelInt64/32768/10        21.8 us         21.8 us        32171 bytes_per_second=1.40112G/s null_percent=10 size=32.768k
SumKernelInt64/32768/50        22.5 us         22.5 us        31062 bytes_per_second=1.35471G/s null_percent=50 size=32.768k

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm closed this in ac74b3b May 26, 2020
@frankdjx
Copy link
Contributor Author

Indeed the dense performance issue only happens for float and double type, compiler did a good job for all int types, that is why I add all data types to sumkernel benchmark.

@jianxind thank you for looking into this! I had suspected that there was room for improvement in this algorithm. You're certainly free to implement kernels requiring intrinsics and put them in aggregate_basic_$SIMD_VERSION.cc, and then we can utilize the CpuInfo from ExecContext during kernel dispatch to select a supported kernel.

Sure I will work on this SIMD chance then, thanks.

@frankdjx frankdjx deleted the SumKernelBenchmark branch May 26, 2020 00:59
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request May 26, 2020
…cution

1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64).
2. Unrolled the aggregate kernel dense part to speculative add the result in parrel.

Signed-off-by: Frank Du <frank.du@intel.com>

Closes apache#7267 from jianxind/SumKernelBenchmark

Authored-by: Frank Du <frank.du@intel.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
@fsaintjacques
Copy link
Contributor

I just wanted to point that this vary greatly depending on the toolchain, e.g. LLVM is usually more aggressive at unrolling than GCC. Since most of our users (python, R) will have libarrow built with gcc, I think it's fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants