-
-
Notifications
You must be signed in to change notification settings - Fork 12k
SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7161e30 to
e01dc6e
Compare
1163a6d to
84c4c2d
Compare
84c4c2d to
8900a72
Compare
518fd92 to
2a01e5f
Compare
360472c to
bb08eb2
Compare
bb08eb2 to
8f829c9
Compare
b958d43 to
a0322ee
Compare
|
ping @mattip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ** $maxopt $werror baseline | |
| ** $maxopt baseline |
remove treating warnings as errors after the CI pass the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I temporarily use this policy during the development to detect any warnings.
|
Nice speedups. Is this for 32-bit float only or also for 64-bit? Edit: 32 bit only. |
The new code improves the performance of non-contiguous memory access for the output array without any reduction in performance. For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
This test should not be exclusive to AVX. this patch also extends unary test to cover different sets of output strides.
a0322ee to
1470654
Compare
|
@mattip, just replaced the raw SIMD code of f32 with NPYV. |
|
Thanks @seiko2plus |
Merge after #17790, #17789
SIMD: Replace raw SIMD of sin/cos with NPYV
The new code improves the performance of non-contiguous memory access
for the output array without any reduction in performance.
For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
TODO:
Performance tests(ASV)
Args
X86
I had to count on my local machine because I couldn't able to get stable ratios using aws.
see standalone benchamrk for AVX512F.
CPU
OS
Benchmark
AVX2 & FMA3 - Changed only
before after ratio [098a3b41] [a0322ee9] <master> <to_npyv_sincos_f32> 259~3us 55.1~0.2us 0.21 bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f') 260~4us 56.2~0.2us 0.22 bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f') 334~0.8us 60.4~0.07us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f') 335~0.9us 61.5~0.2us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f') 337~0.4us 62.1~0.2us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f') 339~2us 61.2~0.6us 0.18 bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f') 266~10us 54.9~0.2us 0.21 bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f') 270~20us 55.6~0.2us 0.21 bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f') 331~3us 60.3~0.1us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f') 332~2us 61.0~0.3us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f') 336~1us 61.7~0.3us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f') 335~0.2us 61.5~0.4us 0.18 bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Changed only
Performance tests(standalone #15987)
Args used within #15987
Note:
--msleep 1force the running thread to sleep 1 millisecond before collecting each sampleto revert any frequency reduction, since it seems that throttling effect on wall time when
AVX512Fis enabled.X86
CPU
OS
Benchmark
AVX512F - Contiguous only
metric: gmean, units: ms
1.131.07AVX512F
metric: gmean, units: ms
14.0214.7612.1713.8814.7612.5412.0314.1812.131.071.081.0916.4816.2216.8216.5316.8817.0215.816.0816.021.071.091.1116.616.6516.5316.6916.9217.0915.816.0116.231.081.111.1115.1415.5415.4614.9215.5415.5914.0214.414.4910.2613.3611.5510.513.4911.619.3512.6311.311.061.0712.2115.4115.5312.7315.7615.2612.214.8514.821.0812.4515.3915.4412.6515.7115.2612.2914.7714.921.081.0911.7914.2614.2811.8114.3713.9111.1713.2413.3AVX2 & FMA3 - Contiguous only
metric: gmean, units: ms
AVX2 & FMA3
metric: gmean, units: ms
7.247.467.597.27.517.617.397.577.728.27.838.558.348.528.528.288.338.320.938.438.518.158.318.548.17.938.368.047.667.797.87.567.787.747.527.657.691.077.57.517.87.57.67.757.567.677.888.488.18.958.778.848.778.688.68.728.778.778.878.878.878.788.768.758.668.638.568.748.698.68.648.528.468.55ARM8 64-bit
CPU
OS
Benchmark
ASIMD - Contiguous only
metric: gmean, units: ms
1.932.02.111.972.032.09ASIMD
metric: gmean, units: ms
1.531.681.751.371.491.561.371.491.561.371.481.571.51.561.631.361.421.471.371.411.481.361.421.491.351.511.571.221.361.421.221.361.421.221.371.431.261.311.381.21.231.291.181.221.291.161.231.282.02.012.061.791.781.831.791.781.831.781.741.831.851.891.931.651.681.711.661.681.721.661.681.721.751.761.791.591.61.631.571.61.641.591.611.641.571.571.611.451.451.51.461.451.51.451.451.5Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Contiguous only
metric: gmean, units: ms
2.942.993.033.163.133.2VSX2(ISA >= 2.07)
metric: gmean, units: ms
2.862.992.922.72.832.882.722.822.892.722.842.872.732.792.872.552.612.582.562.742.622.552.62.652.72.842.832.532.652.652.532.662.652.462.732.752.762.772.892.62.592.72.592.592.72.582.592.73.163.23.172.92.932.92.92.942.822.832.872.92.872.892.92.652.682.682.662.682.682.652.682.692.822.862.92.612.652.692.742.652.692.612.662.692.782.882.912.672.662.722.582.672.712.582.672.71