SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics#20363
SIMD: Replace SVML/ASM of tanh(f32, f64) with universal intrinsics#20363mattip merged 4 commits intonumpy:mainfrom
Conversation
419c3f7 to
54a28e8
Compare
7e7913d to
99c3216
Compare
|
no errors yay |
|
Cool!. At some point this should get a benchmark. There are ULP tests in test_umath_accuracy, so if that is passing the code should be at least as accurate as SVML. |
|
Nice, will review it over the next few days. The |
I still need to introduce other two universal intrinsics (npyv_all_##sfx, npyv_any_##sfx) to speed up boolean tests across the vector elements to speed up falling back to
I just converted the x86/avx512 instructions of SMVL into C intrinsics, I did little tweaks but still almost the same implementation.
I thought |
My bad, it is enabled for |
8e9f073 to
9ed5995
Compare
take it back, no performance changes
done |
|
github still doesn't offer an option to change the pr branch, no way to replace |
|
I am a bit surprised that the benchmarks improve AVX512 performance. Isn't it using SVML? |
|
@mattip, depending on the compiler, the latest versions of clang and GCC may even perform better, the current benchmark only shows improvements on |
seiko2plus
left a comment
There was a problem hiding this comment.
Doubts and perceptions.
| @@ -0,0 +1,395 @@ | |||
| /*@targets | |||
| ** $maxopt baseline | |||
| ** (avx2 fma3) AVX512_SKX | |||
There was a problem hiding this comment.
| ** (avx2 fma3) AVX512_SKX | |
| ** (avx2 fma3) avx512f avx512_skx |
adding target for avx512f is important for xeon phi but that is going to increase the binary size, alternatives:
- remove
avx512_skxand keepavx512fmay lead to lose ~(5:10)% approximated(didn't test it) - keep it as-is, users who target
xeon phishould build NumPy with at least--cpu-baseline=avx512fotherwise they will get(AVX2, FMA3)kernel.
There was a problem hiding this comment.
Can we document/warn that the build might be suboptimal?
There was a problem hiding this comment.
only xeon phi chips are affected. all newer chips from intel with avx512 support skx features. if intel is no longer care about xeon phi then why we does? we can update the doc to notify xeon phi users of this approach and recommend them to raise the ceiling of the baseline features to avx512f feature at least. we already have this notification within the build options doc. see quick start. I will leave it as-is till we see what kind AVX512 features AMD is going to support into their new chips.
r-devulap
left a comment
There was a problem hiding this comment.
I did verify that this change produces the exact same output for platforms with AVX-512. I would suggest getting rid of the C callout for special cases (large numbers, INF and NAN). Apart from the callout portion, everything else looks good.
To bring the benefits of performance for all platforms not just for avx512 on linux without performance/accuracy regression, actually the other way around, better performance and after all maintainable code.
9ed5995 to
70226f7
Compare
done for single-precision. |
Once its fixed for double precision, this PR will be good to merge. |
70226f7 to
eba2d29
Compare
…sics
instead of fallback to C
eba2d29 to
d99bf0e
Compare
|
@r-devulap, it's now for both single & double. I have revisited the assembly code again and I just realized that there's no special handling when |
|
@mattip, @r-devulap ping |
|
Thanks @seiko2plus |
|
I just went on a little goose chase to find out to the
|
|
@rgommers, #21955 didn't reuse the universal intrinsic implementation to handle half-precision over single, instead, it brings SVML objects back. However, @r-devulap should have removed only So you can choose between filtering out |
…ementation See comments on gh-20363
|
Thanks for the context, very helpful. Reducing binary size is important. I commented out |
… small drift rate Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9 Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
… small drift rate Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9 Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
Numpy 1.23 reimplemented tanh function [0,1], which changed the result of positive skew result for small drift rate. This gives a 3rd possible set of results for DriftDiffusionAnalytical SmallDriftRate tests. Print numpy information before running the tests to provide more information about optimizations used by numpy. [0] numpy/numpy#20363 [1] numpy/numpy@75edab9
Replace SVML/ASM of tanh for both single and double precision with universal intrinsics
To bring the benefits of performance for all platforms
not just for
avx512on Linux without performance/accuracy regression,actually the other way around, better performance and
after all maintainable code.
The original code can be found in:
Benchmarks
X86
CPU
OS
Linux ip-172-31-32-40 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:03:59 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Python 3.8.10 gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0Benchmark
SVML/AVX512_SKX(before) vs AVX512_SKX(after)
unset NPY_DISABLE_CPU_FEATURES python3 runtests.py -n --bench-compare parent/main tanh -- --sort namebefore after ratio [a1813504] [b5bd8620] <svml2npyv/tanh~3> <svml2npyv/tanh> - 365±0.5μs 218±0.9μs 0.60 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd') - 82.1±0.3μs 70.9±0.5μs 0.86 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f') - 375±0.6μs 238±0.2μs 0.63 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd') - 379±20μs 248±20μs 0.65 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd') - 366±0.5μs 228±3μs 0.62 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd') - 88.2±0.5μs 83.4±0.3μs 0.94 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f') - 378±0.8μs 254±2μs 0.67 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd') - 382±2μs 272±1μs 0.71 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')AVX2_FMA3
before after ratio [a1813504] [b5bd8620] <svml2npyv/tanh~3> <svml2npyv/tanh> - 1.96±0ms 857±1μs 0.44 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd') - 1.86±0.01ms 250±2μs 0.13 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f') - 1.97±0.05ms 873±30μs 0.44 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd') - 1.86±0.01ms 283±2μs 0.15 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f') - 2.14±0.09ms 962±50μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd') - 1.93±0.04ms 296±8μs 0.15 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f') - 1.96±0ms 884±1μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd') - 1.86±0ms 312±0.9μs 0.17 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f') - 1.97±0ms 885±1μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd') - 1.86±0.02ms 329±4μs 0.18 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f') - 1.97±0.1ms 888±50μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd') - 1.86±0ms 329±0.9μs 0.18 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f') - 1.97±0ms 888±0.9μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd') - 1.86±0ms 315±0.7μs 0.17 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f') - 1.97±0ms 889±2μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd') - 1.86±0ms 331±0.3μs 0.18 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f') - 1.97±0ms 890±3μs 0.45 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd') - 1.87±0.04ms 330±7μs 0.18 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')Power little-endian
CPU
OS
Linux e517009a912a 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0Benchmark
VSX2
AArch64
CPU
OS
Linux ip-172-31-44-172 5.11.0-1020-aws #21~20.04.2-Ubuntu SMP Fri Oct 1 13:01:34 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0Benchmark
ASIMD
before after ratio [a1813504] [869eae28] <svml2npyv/tanh~3> <svml2npyv/tanh> - 2.69±0.01ms 1.39±0ms 0.52 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd') - 2.71±0.01ms 533±2μs 0.20 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'f') - 2.69±0.05ms 1.42±0.02ms 0.53 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd') - 2.70±0.01ms 568±5μs 0.21 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f') - 2.85±0.09ms 1.50±0.04ms 0.52 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd') - 2.78±0.04ms 585±10μs 0.21 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f') - 2.70±0ms 1.44±0ms 0.53 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd') - 2.70±0.01ms 560±3μs 0.21 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'f') - 2.69±0ms 1.47±0ms 0.55 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd') - 2.70±0.02ms 598±7μs 0.22 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f') - 2.69±0.09ms 1.47±0.05ms 0.55 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd') - 2.70±0ms 600±4μs 0.22 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f') - 2.70±0.01ms 1.45±0ms 0.54 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd') - 2.70±0.01ms 593±4μs 0.22 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f') - 2.69±0.01ms 1.47±0ms 0.55 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd') - 2.71±0ms 628±4μs 0.23 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f') - 2.69±0.01ms 1.47±0ms 0.55 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd') - 2.72±0.04ms 630±10μs 0.23 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')Binary size(striped)