core: vectorize countNonZero64f#15685
Conversation
|
|
||
| for(i = 0; i < len0; i += step ) | ||
| { | ||
| sum1 += v_reinterpret_as_s64(vx_load(&src[i]) == zero); |
There was a problem hiding this comment.
Can't these loads and comparisons be done with s64 instead? Comparing to zero is the same and should be faster.
There was a problem hiding this comment.
Floating point numbers are tricky.
There are two zeros possible: +0, -0, so interpretation as int64/uint64 is not valid.
Details: https://en.wikipedia.org/wiki/Signed_zero
BTW, The same bug is in countNonZero32f() optimization.
There was a problem hiding this comment.
How about using a 63-bit mask on top of that?
v_zero_mask = 0x7fffffffffffffff;
((vx_load(&src[i]) & v_zero_mask) == zero)
There was a problem hiding this comment.
Right!
Mask should work (on known platforms).
There was a problem hiding this comment.
This is a subtle IEEE nuance. Assuming a compliant IEEE754-2008 implementation, the sign of zero is ignored. (chapter 5.11)
There was a problem hiding this comment.
Yes, floating-point comparison works good.
but it is not equal to integer-based zero check:
comparisons be done with s64 instead
f32 optimization is fine (v_reinterpret applied after comparison).
Check in current implementation is fine too (due floating-point check).
Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake x86-64.
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
Improves performance a bit. 2.2x on P9 and 2 - 3x on coffee lake
x86-64.
This pullrequest changes
Update countNonZero64f to use SIMD as available on the target platform.