ENH, SIMD: Dispatch for unsigned floor division by ganesh-k13 · Pull Request #18075 · numpy/numpy

ganesh-k13 · 2020-12-27T10:47:14Z

Dispatch for unsigned floor division

Hi @seiko2plus / @Qiyu8 , I am attempting to add fast integer division using universal intrinsics. I wanted your opinion on my approach.

I have marked much of the hardcoded parts with // XXX(everything is for 16bitand only signed). Now I am able to understand the dispatch mechanism to an extent and hit the code paths through dispatch(not through below diff, I hardcoded few run_binary_simd_* for that). But there are few things that do not work with integer types like load and store.

~~We can add stuff in memory.h, but wanted your opinion on the load-store part. Also in general am I on the right path?~~

cc: @seberg

numpy/core/src/umath/loops_arithm_fp.dispatch.c.src

numpy/core/src/common/simd/sse/arithmetic.h

seiko2plus · 2020-12-27T12:31:03Z

@ganesh-k13,

We can add stuff in memory.h, but wanted your opinion on the load-store part. Also in general am I on the right path?

There's no need add any new intrinsics for memory operations, please could you share your SIMD kernel code?

EDIT: sorry I saw it, my bad.

numpy/core/src/umath/loops_arithm_fp.dispatch.c.src

ganesh-k13 · 2020-12-27T18:20:20Z

Thanks @seiko2plus for all the info, I have addressed a few of them. Will do the rest in a while.

ganesh-k13 · 2020-12-30T06:20:42Z

@seiko2plus , any idea why Agner did not go for 64 bit? Should we also just do normal built-in division for 64?

numpy/core/src/common/simd/sse/arithmetic.h

numpy/core/src/common/simd/sse/sse.h

seiko2plus · 2020-12-30T09:10:47Z

any idea why Agner did not go for 64 bit?

I guess not worth it. unroll, specialize a loop under opt level 3 would be enough.

ganesh-k13 · 2021-01-07T15:06:48Z

I have implemented the SSE versions, but I notice UT failing locally for float. How float?

>>> import numpy as np
>>> np.__version__
'0.3.0+24365.g2a4437284'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: divide by zero encountered in floor_divide
inf
>>>

>>> import numpy as np
>>> np.__version__
'1.19.2'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: invalid value encountered in floor_divide
nan
>>>

I gdb'd into the code, specifically FLOAT_divide, that in turn goes into loops_arithm_fp.dispatch.c.src, which I have not touched and don't see any error being set either. Any pointers here will be very helpful before I proceed to AVX

seiko2plus · 2021-01-10T13:28:31Z

@ganesh-k13,

I gdb'd into the code, specifically FLOAT_divide, that in turn goes into loops_arithm_fp.dispatch.c.src

Nothing to do with loops_arithm_fp.dispatch.c.src, FLOAT_divide only performs true divide.
However, this issue has been fixed by #16161 since if non-zero divided by zero should return inf.

seiko2plus · 2021-01-16T21:12:01Z

@ganesh-k13, I created a new pr #18178 that adds fast integer division intrinsics for all SIMD extensions, it should be merged before this pr.

ganesh-k13 · 2021-01-17T05:31:04Z

Thanks, @seiko2plus , I'll rebase once that's merged and add the dispatches 👍

mattip · 2021-03-08T07:25:51Z

gh-18178 is merged

ganesh-k13 · 2021-03-08T08:55:09Z

Thanks, Matti, I have rebased to use the latest code, I'll fix errors and port it to signed as well.

ganesh-k13 · 2021-03-12T14:54:47Z

The UT passes except for one case, here is a trace. Any pointers will be helpful:

(Pdb) p a                                                                                      
masked_array(data=[0, 1, 2, 3, 4, 5, 6, 7, 8, --],                                             
             mask=[False, False, False, False, False, False, False, False,                     
                   False,  True],                                                                                                                                                             
       fill_value=999999,                                                                      
            dtype=uint8)
(Pdb) z                                                                                        
masked_array(data=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],                                              
             mask=False,                                                                       
       fill_value=999999,                                                                      
            dtype=uint8)                                                                       
(Pdb) z//a                                     
masked_array(data=[--, 0, 0, 0, 0, 0, 0, 0, 0, --],                                            
             mask=[ True, False, False, False, False, False, False, False,                     
                   False,  True],                                                              
       fill_value=999999,                                                                      
            dtype=uint8)                                                                       
(Pdb) z//=a                                                                                    
(Pdb) p z                                                                                      
masked_array(data=[--, 1, 1, 1, 1, 1, 1, 1, 1, --],                                            
             mask=[ True, False, False, False, False, False, False, False,                     
                   False,  True],                                                              
       fill_value=999999,                                                                      
            dtype=uint8)

For masked array, // and //= acts differently, anything special needs to be handled here?

seberg · 2021-03-12T15:29:32Z

I would be surprised if this is limited to masked arrays, they shouldn't really do anything special aside from making a copy and filling some value into masked parts.
Trye the same for z._data and a._data? If inplace breaks, my first thought is that there is a problem accounting for src == dst (or in ufunc terms args[0] == args[2])? const char *src might give the wrong idea in that sense, but probably there is more to it.

ganesh-k13 · 2021-03-12T18:17:23Z

Oh I see. There must be a gap in the test plan as other cases are passing for some reason. Yeah, the const in src might be the culprit, will see what can be done. Thanks for the info

numpy/core/src/umath/loops_arithmetic.dispatch.c.src

benchmarks/benchmarks/bench_ufunc.py

ganesh-k13 · 2021-03-22T15:26:22Z

New Bench, deleting old.

Dispatch:

 
########### EXT COMPILER OPTIMIZATION ###########
Platform      : 
  Architecture: x64
  Compiler    : gcc

CPU baseline  : 
  Requested   : 'min'
  Enabled     : SSE SSE2 SSE3
  Flags       : -msse -msse2 -msse3
  Extra checks: none

CPU dispatch  : 
  Requested   : 'max -xop -fma4'
  Enabled     : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
  Generated   : 
              : 
  SSE41       : SSE SSE2 SSE3 SSSE3
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1
  Extra checks: none
  Detect      : SSE SSE2 SSE3 SSSE3 SSE41
              : numpy/core/src/umath/_umath_tests.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : 
  SSE42       : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2
  Extra checks: none
  Detect      : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42
              : numpy/core/src/_simd/_simd.dispatch.c
              : 
  AVX2        : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mavx2
  Extra checks: none
  Detect      : AVX F16C AVX2
              : numpy/core/src/umath/_umath_tests.dispatch.c
              : numpy/core/src/umath/loops_arithm_fp.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : 
  (FMA3 AVX2) : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2
  Extra checks: none
  Detect      : AVX F16C FMA3 AVX2
              : numpy/core/src/_simd/_simd.dispatch.c
              : numpy/core/src/umath/loops_exponent_log.dispatch.c
              : numpy/core/src/umath/loops_trigonometric.dispatch.c
              : 
  AVX512F     : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f
  Extra checks: AVX512F_REDUCE
  Detect      : AVX512F
              : numpy/core/src/_simd/_simd.dispatch.c
              : numpy/core/src/umath/loops_arithm_fp.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : numpy/core/src/umath/loops_exponent_log.dispatch.c
              : numpy/core/src/umath/loops_trigonometric.dispatch.c
              : 
  AVX512_SKX  : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq
  Extra checks: AVX512BW_MASK AVX512DQ_MASK
  Detect      : AVX512_SKX
              : numpy/core/src/_simd/_simd.dispatch.c
              : numpy/core/src/umath/loops_arithmetic.dispatch.c
              : numpy/core/src/umath/loops_exponent_log.dispatch.c

       before           after         ratio                                                                                                                                                   
     [2ccc7942]       [c78d9a0b]                                                                                                                                                              
     <main>           <enh_simd_npyv_floor_div>                                                                                                                                               
-      33.1±0.5μs       30.6±0.2μs     0.93  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)                                                           
-       103±0.2μs      8.36±0.06μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)                                                         
-       106±0.4μs      8.35±0.04μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)                                                          
-      64.6±0.2μs      4.19±0.04μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)                                                         
-      67.3±0.2μs      4.21±0.02μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)                                                          
-      34.0±0.4μs      2.06±0.01μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)                                                          
-      36.4±0.4μs      2.08±0.01μs     0.06  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)                                                           
-      45.3±0.1μs      2.49±0.05μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)                                                         
-      48.0±0.6μs      2.48±0.03μs     0.05  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)                                                          
                                                                                                                                                                                              
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.                                                                                                                                                   
PERFORMANCE INCREASED.

ganesh-k13 · 2021-03-31T10:48:46Z

Hey @seiko2plus , any more tests/changes needed?

seiko2plus

LGTM, just one thing left.

numpy/core/src/umath/loops_arithmetic.dispatch.c.src

seiko2plus · 2021-04-05T22:41:55Z

@ganesh-k13, My apologies for the delayed response, here another benchmark that covers other archs.

X86

CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:            10
CPU MHz:             1800.344
CPU max MHz:         4000.0000
CPU min MHz:         400.0000
BogoMIPS:            3984.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx

OS

Linux seiko-pc 5.8.0-48-generic #54-Ubuntu SMP Fri Mar 19 14:25:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0

Benchmark

AVX2

python runtests.py --bench-compare parent/main time_floor_divide_int

     before           after         ratio
     [623bc1fa]       [a2c5af9c]
                      <enh_simd_npyv_floor_div>
-      22.3±0.4μs      3.31±0.07μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-      22.5±0.4μs      3.28±0.05μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-        70.9±1μs      7.58±0.07μs     0.11  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
-        74.1±3μs       7.72±0.2μs     0.10  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
-      24.8±0.4μs      2.10±0.03μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-      21.8±0.5μs      1.80±0.04μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-      25.5±0.6μs      2.05±0.06μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-      22.2±0.2μs      1.77±0.05μs     0.08  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)

SSE41

export NPY_DISABLE_CPU_FEATURES="AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int

     before           after         ratio
     [623bc1fa]       [a2c5af9c]
                      <enh_simd_npyv_floor_div>
-      22.9±0.4μs       4.50±0.2μs     0.20  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-        23.7±1μs      4.38±0.06μs     0.18  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-        70.5±1μs       11.6±0.3μs     0.16  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
-        77.8±3μs       11.5±0.2μs     0.15  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
-      22.5±0.2μs       2.05±0.2μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-      25.2±0.4μs      2.29±0.05μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-      22.3±0.8μs      1.98±0.04μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-      25.2±0.6μs      2.19±0.03μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)

SSE3

export NPY_DISABLE_CPU_FEATURES="SSE41 AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int

     before           after         ratio
     [623bc1fa]       [a2c5af9c]
                      <enh_simd_npyv_floor_div>
-      22.8±0.5μs       4.82±0.1μs     0.21  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-        22.9±1μs       4.78±0.1μs     0.21  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-        72.8±3μs       12.2±0.6μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
-        71.7±1μs       12.0±0.5μs     0.17  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
-        25.7±1μs       2.43±0.1μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-      25.2±0.3μs      2.38±0.05μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-      22.6±0.5μs      2.11±0.04μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-      22.5±0.4μs      2.10±0.03μs     0.09  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS

Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2

python runtests.py --bench-compare parent/main time_floor_divide_int

       before           after         ratio
     [623bc1fa]       [46b8cfc3]
                      <enh_simd_npyv_floor_div>
-     16.9±0.02μs      10.1±0.01μs     0.59  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-     16.3±0.03μs      7.25±0.01μs     0.44  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-     16.4±0.05μs      7.24±0.01μs     0.44  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-     25.9±0.03μs      10.1±0.01μs     0.39  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-     16.6±0.03μs      5.56±0.01μs     0.33  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-     16.6±0.03μs      5.55±0.02μs     0.33  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)

AArch64

CPU

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         2314.0000
CPU min MHz:         403.0000
BogoMIPS:            52.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

OS

Linux localhost 4.14.113-seiko_fastboot #30 SMP PREEMPT Wed Dec 30 12:28:43 IST 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Benchmark

NEON

python runtests.py --bench-compare parent/main time_floor_divide_int

       before           after         ratio
     [036f6c68]       [f3eb831d]
                      <enh_simd_npyv_floor_div>
-     28.9±0.08μs      19.9±0.04μs     0.69  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-     28.6±0.08μs      12.2±0.03μs     0.43  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-     52.3±0.05μs      20.0±0.04μs     0.38  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-     34.6±0.08μs      12.1±0.03μs     0.35  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-     28.0±0.06μs      8.05±0.04μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-     28.0±0.05μs      8.04±0.05μs     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)

Co-authored-by: Sayed Adel <seiko@imavr.com>

ganesh-k13 · 2021-04-06T03:35:52Z

Thanks a lot @seiko2plus!

ganesh-k13 · 2021-04-06T03:43:29Z

Also can I raise a PR for signed types now, @seiko2plus ? I'll purge libdivide where needed.

mattip · 2021-04-06T05:39:01Z

Thanks @ganesh-k13

seiko2plus · 2021-04-08T15:56:22Z

@ganesh-k13, sure you can, thank you!

seberg · 2021-04-11T15:24:06Z

The change the change in the loops caused a small regression for:

import operator
import numpy as np

result = operator.__floordiv__(np.array([True, False, False]), True)

conveniently found by pandas in pandas-dev/pandas#40874

I think all that is needed is to ensure the loop order is not change in the code generation chunk (unsigned loops before signed ones). And a test to make sure we are the ones who notice it next time would be nice ;). I can create a PR, but if you beat me to it, even better :).

EDIT: The current and new loops are as follows (So I got it wrong above, it is signed before unsigned – of identical precision):

Details

In [1]: np.floor_divide.types
Out[1]: 
['bb->b',
 'BB->B',
 'hh->h',
 'HH->H',
 'ii->i',
 'II->I',
 'll->l',
 'LL->L',
 'qq->q',
 'QQ->Q',
 'ee->e',
 'ff->f',
 'dd->d',
 'gg->g',
 'FF->F',
 'DD->D',
 'GG->G',
 'mq->m',
 'md->m',
 'mm->q',
 'OO->O']

New loops:

In [1]: np.floor_divide.types
Out[1]: 
['BB->B',
 'HH->H',
 'II->I',
 'LL->L',
 'QQ->Q',
 'bb->b',
 'hh->h',
 'ii->i',
 'll->l',
 'qq->q',
 'ee->e',
 'ff->f',
 'dd->d',
 'gg->g',
 'FF->F',
 'DD->D',
 'GG->G',
 'mq->m',
 'md->m',
 'mm->q',
 'OO->O']

ganesh-k13 · 2021-04-11T17:13:10Z

Hey Sebastian, sorry for the regression. I have made the change locally and able to fix with your RCA. I am adding tests and will raise a PR shortly.

BUG: Regression #18075 | Fixing Ufunc TD generation order

github-actions bot added the 01 - Enhancement label Dec 27, 2020

ganesh-k13 marked this pull request as draft December 27, 2020 10:48

ganesh-k13 changed the title ~~ENH,SIMD: [DRAFT] Added fast divide for SSE~~ WIP, ENH, SIMD: [DRAFT] Added fast divide for SSE Dec 27, 2020

github-actions bot added the 25 - WIP label Dec 27, 2020

seiko2plus reviewed Dec 27, 2020

View reviewed changes

numpy/core/src/umath/loops_arithm_fp.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus reviewed Dec 27, 2020

View reviewed changes

numpy/core/src/common/simd/sse/arithmetic.h Outdated Show resolved Hide resolved

seiko2plus reviewed Dec 27, 2020

View reviewed changes

numpy/core/src/umath/loops_arithm_fp.dispatch.c.src Outdated Show resolved Hide resolved

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 79fada9 to 607fd92 Compare December 27, 2020 18:19

ganesh-k13 commented Dec 30, 2020

View reviewed changes

numpy/core/src/common/simd/sse/arithmetic.h Outdated Show resolved Hide resolved

seiko2plus reviewed Dec 30, 2020

View reviewed changes

numpy/core/src/common/simd/sse/sse.h Outdated Show resolved Hide resolved

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jan 10, 2021

seiko2plus self-assigned this Jan 10, 2021

seiko2plus mentioned this pull request Jan 16, 2021

SIMD: add fast integer division intrinsics for all supported platforms #18178

Merged

2 tasks

Base automatically changed from master to main March 4, 2021 02:05

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 2a44372 to dba8572 Compare March 8, 2021 08:53

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 798458c to 1f3a0f5 Compare March 10, 2021 14:35

seiko2plus requested changes Mar 12, 2021

View reviewed changes

SIMD, MAINT: Refined kernel and inner ufunc functions

bbb1436

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 1a4eb1b to 812a9aa Compare March 20, 2021 11:29

TST: Division tests for unsigned ints

c78d9a0

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 812a9aa to 0717ae1 Compare March 20, 2021 11:32

seiko2plus reviewed Mar 22, 2021

View reviewed changes

benchmarks/benchmarks/bench_ufunc.py Outdated Show resolved Hide resolved

benchmarks/benchmarks/bench_ufunc.py Outdated Show resolved Hide resolved

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 0717ae1 to 7d37f7e Compare March 22, 2021 15:23

BENCH: Benchmarks for unsigned ints (numpy#18075)

a2c5af9

ganesh-k13 force-pushed the enh_simd_npyv_floor_div branch from 7d37f7e to a2c5af9 Compare March 22, 2021 15:28

seiko2plus approved these changes Apr 5, 2021

View reviewed changes

numpy/core/src/umath/loops_arithmetic.dispatch.c.src Outdated Show resolved Hide resolved

SIMD: Use scalar division for Armv7, Aarch64, and IBM/Power

4d2e484

Co-authored-by: Sayed Adel <seiko@imavr.com>

mattip merged commit 914407d into numpy:main Apr 6, 2021

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request Apr 11, 2021

TST: Test for ufunc orders. Reg (numpy#18075)

468ff05

charris added a commit that referenced this pull request Apr 12, 2021

Merge pull request #18751 from ganesh-k13/reg_18075

7a81299

BUG: Regression #18075 | Fixing Ufunc TD generation order

ganesh-k13 mentioned this pull request Apr 13, 2021

ENH, SIMD: Replace libdivide functions of signed integer division with universal intrinsics #18766

Merged

seiko2plus mentioned this pull request May 19, 2021

BUG: Unexpected result in uint8 division on x86 #19045

Closed

Uh oh!

Conversation

ganesh-k13 commented Dec 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dispatch for unsigned floor division

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Dec 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ganesh-k13 commented Dec 27, 2020

Uh oh!

ganesh-k13 commented Dec 30, 2020

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Dec 30, 2020

Uh oh!

ganesh-k13 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus commented Jan 10, 2021

Uh oh!

seiko2plus commented Jan 16, 2021

Uh oh!

ganesh-k13 commented Jan 17, 2021

Uh oh!

mattip commented Mar 8, 2021

Uh oh!

ganesh-k13 commented Mar 8, 2021

Uh oh!

ganesh-k13 commented Mar 12, 2021

Uh oh!

seberg commented Mar 12, 2021

Uh oh!

ganesh-k13 commented Mar 12, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ganesh-k13 commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganesh-k13 commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seiko2plus commented Apr 5, 2021

X86

Benchmark

Power little-endian

Benchmark

AArch64

Benchmark

Uh oh!

ganesh-k13 commented Apr 6, 2021

Uh oh!

ganesh-k13 commented Apr 6, 2021

Uh oh!

mattip commented Apr 6, 2021

Uh oh!

seiko2plus commented Apr 8, 2021

Uh oh!

seberg commented Apr 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganesh-k13 commented Dec 27, 2020 •

edited

Loading

seiko2plus commented Dec 27, 2020 •

edited

Loading

ganesh-k13 commented Jan 7, 2021 •

edited

Loading

ganesh-k13 commented Mar 22, 2021 •

edited

Loading

ganesh-k13 commented Mar 31, 2021 •

edited

Loading

seberg commented Apr 11, 2021 •

edited

Loading