MAINT: Optimize numpy.count_nonzero for int types using SIMD operations#18183
MAINT: Optimize numpy.count_nonzero for int types using SIMD operations#18183mattip merged 14 commits intonumpy:masterfrom
Conversation
|
@Qiyu8 , I have replaced the MIN/MAX Macros, placed the NPY_SIMD checking guard at the proper place, merged the count_nonzero_int16/32/64 functions into a single function and added benchmarks for the 4 int types. |
| vsum64 = npyv_add_u64(vsum64, vt); | ||
| } | ||
|
|
||
| npy_uint64 sums[npyv_nlanes_u64]; |
There was a problem hiding this comment.
you can use new acceleration intrinsics after #18200 merged.
There was a problem hiding this comment.
@Qiyu8 , I have replaced the manual sums with horizontal SIMD sums.
There was a problem hiding this comment.
Well done, The replaced part looks good to me, Now you need to focus on fix the CI failures and provide ASV benchmark result.
|
Sorry to hijack this thread but on a related topic on nonzero(), is there a reason why calling nonzero on a 1D array is orders of magnitude faster than a multi-dimensional array? For example, calling it on a Boolean array of shape (1000000,) is taking ~40 µs, while it takes 1400 µs for an array of shape (1000,1000). Both arrays are identical in values and only differ in shape. Any idea what's the significant overhead cost here? |
|
@gnool , without further investigation, I am speculating the overhead is coming from the use of an iterator and calls to |
|
I make heavy use of |
|
@seiko2plus , @Qiyu8 I have pushed updates. |
|
@tylerjereddy , can you point me to some of your use cases of |
|
@touqir14 Finding the indices where an extremely large array of bools is Maybe something like: # test_array is a huge 1D array of np.float64
relevant_indices = np.nonzero(test_array > 0.5) |
|
Yes, a speedup in this case is possible. I will push a commit implementing the optimization for special cases later today or tomorrow. @tylerjereddy |
|
@Qiyu8 , @seiko2plus , I also want to add optimizations to |
let's keep this pull-request only for |
|
@seiko2plus , the overflow possibilities have been taken care of. Please see my last commit to verify. Looks like distinguishing dtypes using |
|
@seiko2plus , are we all good now? If so, please merge this PR. |
|
@touqir14, I made some changes in order to increase readability and reduce the amount of code, it wouldn't affect performance.
yes, I think it's good. I prefer to wait one day more to give a chance to the others to look at the code. |
|
It would be nice to see a report of the benchmark changes before/after this PR to make sure we have not by mistake slowed any cases (non-contiguous?, F-order?) down. |
Performance has increased for all supported arches, check the following benchmarks: Power9/GCC 9.2.1(baseline VSX2)python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5 before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
2.85±0μs 2.35±0μs 0.82 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int16'>)
2.76±0.01μs 2.38±0μs 0.86 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
2.77±0μs 2.40±0.01μs 0.87 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int64'>)
2.84±0.01μs 2.36±0.02μs 0.83 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int8'>)
78.4±0.2μs 4.53±0μs 0.06 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
71.8±0.02μs 5.79±0.01μs 0.08 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
72.7±0.1μs 8.63±0.01μs 0.12 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
78.3±0.02μs 3.67±0.03μs 0.05 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
7.58±0ms 174±0.2μs 0.02 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
6.93±0.01ms 292±0.3μs 0.04 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
7.00±0ms 585±0.3μs 0.08 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
7.95±0.2ms 90.6±0.05μs 0.01 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
3.60±0μs 2.37±0μs 0.66 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.47±0.01μs 2.41±0μs 0.69 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
3.50±0μs 2.48±0μs 0.71 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.60±0.01μs 2.38±0.01μs 0.66 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
154±0.1μs 6.26±0μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
141±0.02μs 8.66±0.03μs 0.06 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
142±0.04μs 14.4±0.01μs 0.10 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
154±0.05μs 4.55±0.01μs 0.03 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
15.2±0.03ms 343±0.2μs 0.02 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
13.9±0.02ms 580±0.4μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
14.0±0.01ms 1.27±0.01ms 0.09 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
15.3±0.1ms 177±0.1μs 0.01 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
4.38±0.01μs 2.39±0.01μs 0.54 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
4.18±0.02μs 2.46±0.02μs 0.59 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
4.21±0.01μs 2.57±0.01μs 0.61 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
4.37±0μs 2.37±0μs 0.54 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
230±0.2μs 7.99±0.01μs 0.03 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
210±0.08μs 11.6±0.1μs 0.06 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
212±0.02μs 20.3±0.05μs 0.10 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
230±0.3μs 5.45±0.01μs 0.02 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
22.9±0.5ms 513±0.4μs 0.02 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
20.8±0.01ms 909±4μs 0.04 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
21.0±0.01ms 2.09±0.03ms 0.10 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
22.8±0.03ms 263±0.1μs 0.01 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)i7-8550U[low-power]/GCC 8.4.0(baseline AVX2)python runtests.py -j8 --cpu-baseline="avx2" --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5 before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
31.7±0.01μs 4.90±0.04μs 0.15 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
31.8±0.06μs 6.04±0.1μs 0.19 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
32.0±0.2μs 8.26±0.09μs 0.26 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
31.7±0.1μs 4.41±0.03μs 0.14 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
2.86±0.01ms 121±1μs 0.04 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
2.89±0.01ms 231±2μs 0.08 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.04±0.02ms 565±10μs 0.19 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
2.82±0.01ms 81.3±0.08μs 0.03 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
3.55±0.01μs 3.22±0.01μs 0.91 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.58±0.05μs 3.22±0.01μs 0.90 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
3.55±0.01μs 3.29±0.01μs 0.93 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.58±0.03μs 3.19±0.02μs 0.89 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
59.9±0.04μs 6.01±0.02μs 0.10 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
60.0±0.07μs 8.29±0.2μs 0.14 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
60.8±0.03μs 13.0±0.06μs 0.21 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
59.6±0.03μs 5.26±0.02μs 0.09 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
5.73±0.01ms 236±5μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
5.83±0.02ms 555±6μs 0.10 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
6.13±0.01ms 1.39±0.02ms 0.23 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
5.68±0.01ms 156±3μs 0.03 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
3.83±0.03μs 3.23±0.02μs 0.84 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
3.82±0.01μs 3.25±0.01μs 0.85 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
3.84±0μs 3.32±0.01μs 0.86 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.82±0.02μs 3.23±0.02μs 0.85 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
87.8±0.04μs 7.29±0.03μs 0.08 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
88.3±0.04μs 10.6±0.08μs 0.12 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
89.5±0.3μs 18.0±0.1μs 0.20 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
87.7±0.03μs 6.08±0.02μs 0.07 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
8.59±0.02ms 374±10μs 0.04 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
8.80±0.02ms 1.02±0.03ms 0.12 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
9.19±0.03ms 2.05±0.01ms 0.22 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
8.49±0.03ms 229±2μs 0.03 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
i7-8550U[low-power]/GCC 8.4.0(baseline SSE3)python runtests.py -j8 --bench-compare master CountNonzero -- --sort name --cpu-affinity 1,5 before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
7.04±0.05μs 6.54±0.01μs 0.93 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'object'>)
37.2±0.01μs 5.63±0.1μs 0.15 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
37.5±0.06μs 8.50±0.2μs 0.23 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
37.8±0.1μs 12.9±0.03μs 0.34 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
37.1±0.02μs 5.00±0.2μs 0.13 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
408±0.9μs 357±3μs 0.88 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'object'>)
3.41±0.01ms 197±8μs 0.06 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
3.47±0.03ms 393±8μs 0.11 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.62±0.01ms 1.07±0.04ms 0.29 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
3.37±0ms 128±0.9μs 0.04 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
40.7±0.08ms 35.3±0.05ms 0.87 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'object'>)
3.66±0.02μs 3.32±0.04μs 0.91 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int16'>)
3.68±0.05μs 3.44±0.03μs 0.94 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.66±0.03μs 3.26±0.02μs 0.89 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
11.1±0.02μs 10.1±0.06μs 0.91 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'object'>)
71.1±0.04μs 8.01±0.2μs 0.11 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
71.2±0.05μs 11.6±0.6μs 0.16 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
71.9±0.1μs 22.4±0.1μs 0.31 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
70.9±0.3μs 6.30±0.09μs 0.09 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
813±3μs 710±8μs 0.87 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'object'>)
6.83±0.01ms 376±1μs 0.06 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
6.97±0.01ms 889±40μs 0.13 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
7.23±0.01ms 2.32±0.01ms 0.32 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
6.80±0.01ms 259±20μs 0.04 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
81.4±0.2ms 70.7±0.09ms 0.87 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'object'>)
4.02±0.07μs 3.39±0.02μs 0.84 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
4.02±0.02μs 3.31±0.02μs 0.82 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
4.00±0.02μs 3.51±0.01μs 0.88 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.99±0.01μs 3.27±0.02μs 0.82 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
15.2±0.1μs 13.6±0.05μs 0.90 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'object'>)
105±0.2μs 9.56±0.02μs 0.09 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
105±0.02μs 15.6±0.6μs 0.15 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
106±0.1μs 32.0±0.2μs 0.30 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
104±0.5μs 7.68±0.07μs 0.07 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
1.22±0ms 1.06±0.01ms 0.87 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'object'>)
10.3±0.03ms 595±7μs 0.06 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
10.4±0.1ms 1.37±0.02ms 0.13 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
10.8±0.02ms 3.43±0.08ms 0.32 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
10.1±0.02ms 390±10μs 0.04 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
122±0.3ms 106±0.4ms 0.87 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'object'>)Cortex-A53/GCC 9.3.0(baseline ASIMD)python runtests.py -j8 --bench-compare master CountNonzero -- --sort name before after ratio (numaxes size dtype)
[7a18e4ac] [85e2ce98]
<master> <count_nonzero>
2.55±0.1μs 2.35±0.01μs 0.92 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'numpy.int32'>)
10.5±9μs 4.35±0.02μs 0.41 bench_core.CountNonzero.time_count_nonzero(1, 100, <class 'str'>)
49.5±1μs 5.58±0.04μs 0.11 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int16'>)
33.4±0.4μs 8.12±0.04μs 0.24 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int32'>)
33.9±0.3μs 12.5±0.07μs 0.37 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
48.5±2μs 4.28±0.02μs 0.09 bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int8'>)
4.69±0.02ms 258±2μs 0.06 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
3.06±0.01ms 501±4μs 0.16 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
3.08±0.01ms 970±9μs 0.32 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
4.67±0.02ms 135±3μs 0.03 bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
5.15±3μs 2.38±0.02μs 0.46 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'bool'>)
6.44±4μs 2.41±0.01μs 0.37 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int32'>)
2.85±0.07μs 2.54±0.02μs 0.89 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int64'>)
3.12±0.03μs 2.39±0.03μs 0.76 bench_core.CountNonzero.time_count_nonzero(2, 100, <class 'numpy.int8'>)
96.1±0.03μs 8.15±0.02μs 0.08 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
63.9±0.4μs 12.5±0.03μs 0.20 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
64.1±0.09μs 21.0±0.04μs 0.33 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
95.7±0.05μs 5.85±0.04μs 0.06 bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int8'>)
9.34±0.05ms 499±2μs 0.05 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
6.10±0.05ms 972±10μs 0.16 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
6.12±0.02ms 1.89±0.01ms 0.31 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
9.32±0.03ms 281±2μs 0.03 bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
3.68±0.07μs 2.38±0μs 0.65 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int16'>)
3.18±0.06μs 2.48±0.01μs 0.78 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int32'>)
3.23±0.07μs 2.68±0.02μs 0.83 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int64'>)
3.60±0.06μs 2.40±0.03μs 0.67 bench_core.CountNonzero.time_count_nonzero(3, 100, <class 'numpy.int8'>)
143±0.2μs 10.4±0.1μs 0.07 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
94.2±2μs 16.8±0.1μs 0.18 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
94.4±0.4μs 29.6±0.08μs 0.31 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
142±0.2μs 7.33±0.04μs 0.05 bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int8'>)
14.0±2ms 735±5μs 0.05 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
9.10±0.02ms 1.42±0.01ms 0.16 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
9.13±0.06ms 2.81±0.02ms 0.31 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
14.0±0.02ms 408±6μs 0.03 bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
|
|
Thanks @touqir14 |
As pointed out in this issue,
numpy.count_nonzero,numpy.nonzero,numpy.flatnonzeroare rather slow which could use some optimization. This PR optimizesnumpy.count_nonzerofor signed and unsigned 8 bit, 16 bit, 32 bit and 64 bit integers using SIMD operations. This in turn speeds upnumpy.flatnonzero,numpy.nonzero, and several other functions that depend onnumpy.count_nonzero.Below, I have given benchmarks to showcase the speed improvements for the integer types with AVX2.
I have added few test cases for each of the integer types. Let me know if more is required.