-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution #7267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64). 2. Unlooped the aggregate kernel dense part to speculative add the result in parrel. Signed-off-by: Frank Du <frank.du@intel.com>
|
We find the SumKernel benchmark dense(null percent 0) results is relatively low compared to sparse part for float and double type. Below is the result before unrolled the loop.
With the unroll change, the dense sumkernel benchmark get 3.7x improvement for float and 2.6x speed for double.
Anyway, it can get more higher performance if using intrinsic, I'd like to work at later point. |
|
@jianxind thank you for looking into this! I had suspected that there was room for improvement in this algorithm. You're certainly free to implement kernels requiring intrinsics and put them in aggregate_basic_$SIMD_VERSION.cc, and then we can utilize the CpuInfo from |
|
This shows a small benefit on ARM64 architecture (ThunderX1): before: after (see the Int64 benchmarks): |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Indeed the dense performance issue only happens for float and double type, compiler did a good job for all int types, that is why I add all data types to sumkernel benchmark.
Sure I will work on this SIMD chance then, thanks. |
…cution 1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64). 2. Unrolled the aggregate kernel dense part to speculative add the result in parrel. Signed-off-by: Frank Du <frank.du@intel.com> Closes apache#7267 from jianxind/SumKernelBenchmark Authored-by: Frank Du <frank.du@intel.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
|
I just wanted to point that this vary greatly depending on the toolchain, e.g. LLVM is usually more aggressive at unrolling than GCC. Since most of our users (python, R) will have libarrow built with gcc, I think it's fine. |
Signed-off-by: Frank Du frank.du@intel.com