Skip to content

core: vectorize countNonZero64f#15685

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:cnz64f-simd
Oct 18, 2019
Merged

core: vectorize countNonZero64f#15685
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:cnz64f-simd

Conversation

@pmur
Copy link
Copy Markdown
Contributor

@pmur pmur commented Oct 10, 2019

Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake
x86-64.

This pullrequest changes

Update countNonZero64f to use SIMD as available on the target platform.

force_builders=Linux AVX2,Custom
buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
disable_ipp=ON


for(i = 0; i < len0; i += step )
{
sum1 += v_reinterpret_as_s64(vx_load(&src[i]) == zero);
Copy link
Copy Markdown
Contributor

@ChipKerchner ChipKerchner Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these loads and comparisons be done with s64 instead? Comparing to zero is the same and should be faster.

Copy link
Copy Markdown
Member

@alalek alalek Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating point numbers are tricky.
There are two zeros possible: +0, -0, so interpretation as int64/uint64 is not valid.
Details: https://en.wikipedia.org/wiki/Signed_zero

BTW, The same bug is in countNonZero32f() optimization.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a 63-bit mask on top of that?

v_zero_mask = 0x7fffffffffffffff;

((vx_load(&src[i]) & v_zero_mask) == zero)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!
Mask should work (on known platforms).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle IEEE nuance. Assuming a compliant IEEE754-2008 implementation, the sign of zero is ignored. (chapter 5.11)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, floating-point comparison works good.
but it is not equal to integer-based zero check:

comparisons be done with s64 instead

f32 optimization is fine (v_reinterpret applied after comparison).
Check in current implementation is fine too (due floating-point check).

Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake
x86-64.
@terfendail
Copy link
Copy Markdown
Contributor

Performance for SSE2 baseline
Performance test Reference time PR time Speedup
countNonZero::Size_MatType::(127x61, 8UC1) 0.000 0.000 0.94
countNonZero::Size_MatType::(127x61, 8SC1) 0.000 0.000 0.97
countNonZero::Size_MatType::(127x61, 16UC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 16SC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 32SC1) 0.001 0.001 1.01
countNonZero::Size_MatType::(127x61, 32FC1) 0.001 0.001 1.00
countNonZero::Size_MatType::(127x61, 64FC1) 0.004 0.001 2.66
countNonZero::Size_MatType::(640x480, 8UC1) 0.011 0.011 1.01
countNonZero::Size_MatType::(640x480, 8SC1) 0.011 0.011 1.01
countNonZero::Size_MatType::(640x480, 16UC1) 0.015 0.015 1.02
countNonZero::Size_MatType::(640x480, 16SC1) 0.015 0.015 1.00
countNonZero::Size_MatType::(640x480, 32SC1) 0.038 0.035 1.10
countNonZero::Size_MatType::(640x480, 32FC1) 0.038 0.035 1.09
countNonZero::Size_MatType::(640x480, 64FC1) 0.134 0.088 1.52
countNonZero::Size_MatType::(1280x720, 8UC1) 0.034 0.033 1.01
countNonZero::Size_MatType::(1280x720, 8SC1) 0.034 0.033 1.02
countNonZero::Size_MatType::(1280x720, 16UC1) 0.071 0.064 1.11
countNonZero::Size_MatType::(1280x720, 16SC1) 0.072 0.065 1.11
countNonZero::Size_MatType::(1280x720, 32SC1) 0.140 0.133 1.06
countNonZero::Size_MatType::(1280x720, 32FC1) 0.140 0.131 1.07
countNonZero::Size_MatType::(1280x720, 64FC1) 0.405 0.278 1.46
countNonZero::Size_MatType::(1920x1080, 8UC1) 0.089 0.079 1.13
countNonZero::Size_MatType::(1920x1080, 8SC1) 0.089 0.079 1.12
countNonZero::Size_MatType::(1920x1080, 16UC1) 0.160 0.149 1.07
countNonZero::Size_MatType::(1920x1080, 16SC1) 0.160 0.151 1.07
countNonZero::Size_MatType::(1920x1080, 32SC1) 0.333 0.317 1.05
countNonZero::Size_MatType::(1920x1080, 32FC1) 0.337 0.321 1.05
countNonZero::Size_MatType::(1920x1080, 64FC1) 1.240 0.878 1.41
Performance for SSE3 baseline
Performance test Reference time PR time Speedup
countNonZero::Size_MatType::(127x61, 8UC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 8SC1) 0.000 0.000 0.98
countNonZero::Size_MatType::(127x61, 16UC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 16SC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 32SC1) 0.001 0.001 1.02
countNonZero::Size_MatType::(127x61, 32FC1) 0.001 0.001 0.98
countNonZero::Size_MatType::(127x61, 64FC1) 0.003 0.001 2.63
countNonZero::Size_MatType::(640x480, 8UC1) 0.012 0.012 1.01
countNonZero::Size_MatType::(640x480, 8SC1) 0.012 0.012 1.00
countNonZero::Size_MatType::(640x480, 16UC1) 0.015 0.015 0.98
countNonZero::Size_MatType::(640x480, 16SC1) 0.015 0.015 1.00
countNonZero::Size_MatType::(640x480, 32SC1) 0.034 0.038 0.88
countNonZero::Size_MatType::(640x480, 32FC1) 0.032 0.038 0.84
countNonZero::Size_MatType::(640x480, 64FC1) 0.130 0.094 1.39
countNonZero::Size_MatType::(1280x720, 8UC1) 0.037 0.037 1.00
countNonZero::Size_MatType::(1280x720, 8SC1) 0.037 0.037 0.99
countNonZero::Size_MatType::(1280x720, 16UC1) 0.067 0.070 0.95
countNonZero::Size_MatType::(1280x720, 16SC1) 0.067 0.070 0.95
countNonZero::Size_MatType::(1280x720, 32SC1) 0.135 0.142 0.95
countNonZero::Size_MatType::(1280x720, 32FC1) 0.136 0.141 0.96
countNonZero::Size_MatType::(1280x720, 64FC1) 0.404 0.299 1.35
countNonZero::Size_MatType::(1920x1080, 8UC1) 0.083 0.098 0.84
countNonZero::Size_MatType::(1920x1080, 8SC1) 0.085 0.098 0.86
countNonZero::Size_MatType::(1920x1080, 16UC1) 0.156 0.164 0.95
countNonZero::Size_MatType::(1920x1080, 16SC1) 0.154 0.163 0.94
countNonZero::Size_MatType::(1920x1080, 32SC1) 0.328 0.349 0.94
countNonZero::Size_MatType::(1920x1080, 32FC1) 0.325 0.348 0.93
countNonZero::Size_MatType::(1920x1080, 64FC1) 1.254 0.931 1.35
Performance for SSE4_2 baseline
Performance test Reference time PR time Speedup
countNonZero::Size_MatType::(127x61, 8UC1) 0.000 0.000 0.97
countNonZero::Size_MatType::(127x61, 8SC1) 0.000 0.000 0.99
countNonZero::Size_MatType::(127x61, 16UC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 16SC1) 0.000 0.000 1.00
countNonZero::Size_MatType::(127x61, 32SC1) 0.001 0.001 1.02
countNonZero::Size_MatType::(127x61, 32FC1) 0.001 0.001 1.00
countNonZero::Size_MatType::(127x61, 64FC1) 0.002 0.001 1.87
countNonZero::Size_MatType::(640x480, 8UC1) 0.012 0.012 1.05
countNonZero::Size_MatType::(640x480, 8SC1) 0.011 0.012 0.95
countNonZero::Size_MatType::(640x480, 16UC1) 0.014 0.015 0.95
countNonZero::Size_MatType::(640x480, 16SC1) 0.015 0.014 1.03
countNonZero::Size_MatType::(640x480, 32SC1) 0.033 0.033 1.00
countNonZero::Size_MatType::(640x480, 32FC1) 0.033 0.034 0.97
countNonZero::Size_MatType::(640x480, 64FC1) 0.100 0.088 1.13
countNonZero::Size_MatType::(1280x720, 8UC1) 0.038 0.038 1.00
countNonZero::Size_MatType::(1280x720, 8SC1) 0.037 0.038 0.98
countNonZero::Size_MatType::(1280x720, 16UC1) 0.066 0.065 1.02
countNonZero::Size_MatType::(1280x720, 16SC1) 0.066 0.065 1.02
countNonZero::Size_MatType::(1280x720, 32SC1) 0.130 0.133 0.97
countNonZero::Size_MatType::(1280x720, 32FC1) 0.130 0.133 0.98
countNonZero::Size_MatType::(1280x720, 64FC1) 0.306 0.278 1.10
countNonZero::Size_MatType::(1920x1080, 8UC1) 0.086 0.086 1.00
countNonZero::Size_MatType::(1920x1080, 8SC1) 0.084 0.085 0.98
countNonZero::Size_MatType::(1920x1080, 16UC1) 0.150 0.149 1.00
countNonZero::Size_MatType::(1920x1080, 16SC1) 0.149 0.149 1.00
countNonZero::Size_MatType::(1920x1080, 32SC1) 0.316 0.318 1.00
countNonZero::Size_MatType::(1920x1080, 32FC1) 0.317 0.317 1.00
countNonZero::Size_MatType::(1920x1080, 64FC1) 1.054 0.880 1.20
Performance for AVX2 baseline
Performance test Reference time PR time Speedup
countNonZero::Size_MatType::(127x61, 8UC1) 0.000 0.000 0.99
countNonZero::Size_MatType::(127x61, 8SC1) 0.000 0.000 1.02
countNonZero::Size_MatType::(127x61, 16UC1) 0.000 0.000 1.03
countNonZero::Size_MatType::(127x61, 16SC1) 0.000 0.000 1.03
countNonZero::Size_MatType::(127x61, 32SC1) 0.001 0.001 1.00
countNonZero::Size_MatType::(127x61, 32FC1) 0.001 0.001 1.02
countNonZero::Size_MatType::(127x61, 64FC1) 0.003 0.001 3.08
countNonZero::Size_MatType::(640x480, 8UC1) 0.007 0.007 1.01
countNonZero::Size_MatType::(640x480, 8SC1) 0.007 0.007 1.01
countNonZero::Size_MatType::(640x480, 16UC1) 0.012 0.011 1.04
countNonZero::Size_MatType::(640x480, 16SC1) 0.012 0.011 1.02
countNonZero::Size_MatType::(640x480, 32SC1) 0.033 0.036 0.91
countNonZero::Size_MatType::(640x480, 32FC1) 0.033 0.037 0.89
countNonZero::Size_MatType::(640x480, 64FC1) 0.123 0.091 1.35
countNonZero::Size_MatType::(1280x720, 8UC1) 0.022 0.022 0.99
countNonZero::Size_MatType::(1280x720, 8SC1) 0.022 0.022 0.99
countNonZero::Size_MatType::(1280x720, 16UC1) 0.066 0.069 0.95
countNonZero::Size_MatType::(1280x720, 16SC1) 0.065 0.070 0.94
countNonZero::Size_MatType::(1280x720, 32SC1) 0.131 0.138 0.95
countNonZero::Size_MatType::(1280x720, 32FC1) 0.132 0.141 0.94
countNonZero::Size_MatType::(1280x720, 64FC1) 0.384 0.289 1.33
countNonZero::Size_MatType::(1920x1080, 8UC1) 0.074 0.079 0.93
countNonZero::Size_MatType::(1920x1080, 8SC1) 0.073 0.079 0.93
countNonZero::Size_MatType::(1920x1080, 16UC1) 0.148 0.160 0.93
countNonZero::Size_MatType::(1920x1080, 16SC1) 0.148 0.158 0.93
countNonZero::Size_MatType::(1920x1080, 32SC1) 0.317 0.339 0.94
countNonZero::Size_MatType::(1920x1080, 32FC1) 0.322 0.340 0.95
countNonZero::Size_MatType::(1920x1080, 64FC1) 1.199 0.908 1.32

opencv-pushbot pushed a commit that referenced this pull request Oct 18, 2019
@opencv-pushbot opencv-pushbot merged commit ec91a3d into opencv:3.4 Oct 18, 2019
@alalek alalek mentioned this pull request Oct 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants