ENH, SIMD: Dispatch for unsigned floor division#18075
Conversation
There's no need add any new intrinsics for memory operations, please could you share your SIMD kernel code? EDIT: sorry I saw it, my bad. |
79fada9 to
607fd92
Compare
|
Thanks @seiko2plus for all the info, I have addressed a few of them. Will do the rest in a while. |
|
@seiko2plus , any idea why Agner did not go for 64 bit? Should we also just do normal built-in division for 64? |
I guess not worth it. unroll, specialize a loop under opt level 3 would be enough. |
|
I have implemented the SSE versions, but I notice UT failing locally for float. How float? >>> import numpy as np
>>> np.__version__
'0.3.0+24365.g2a4437284'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: divide by zero encountered in floor_divide
inf
>>>>>> import numpy as np
>>> np.__version__
'1.19.2'
>>> fone = np.array(1.0, dtype=np.float32)
>>> fzer = np.array(0.0, dtype=np.float32)
>>> np.floor_divide(fone, fzer)
<stdin>:1: RuntimeWarning: invalid value encountered in floor_divide
nan
>>>I gdb'd into the code, specifically |
Nothing to do with |
|
@ganesh-k13, I created a new pr #18178 that adds fast integer division intrinsics for all SIMD extensions, it should be merged before this pr. |
|
Thanks, @seiko2plus , I'll rebase once that's merged and add the dispatches 👍 |
|
gh-18178 is merged |
2a44372 to
dba8572
Compare
|
Thanks, Matti, I have rebased to use the latest code, I'll fix errors and port it to signed as well. |
798458c to
1f3a0f5
Compare
|
The UT passes except for one case, here is a trace. Any pointers will be helpful: For masked array, |
|
I would be surprised if this is limited to masked arrays, they shouldn't really do anything special aside from making a copy and filling some value into masked parts. |
|
Oh I see. There must be a gap in the test plan as other cases are passing for some reason. Yeah, the const in src might be the culprit, will see what can be done. Thanks for the info |
1a4eb1b to
812a9aa
Compare
812a9aa to
0717ae1
Compare
0717ae1 to
7d37f7e
Compare
|
New Bench, deleting old. Dispatch:
########### EXT COMPILER OPTIMIZATION ###########
Platform :
Architecture: x64
Compiler : gcc
CPU baseline :
Requested : 'min'
Enabled : SSE SSE2 SSE3
Flags : -msse -msse2 -msse3
Extra checks: none
CPU dispatch :
Requested : 'max -xop -fma4'
Enabled : SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
Generated :
:
SSE41 : SSE SSE2 SSE3 SSSE3
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1
Extra checks: none
Detect : SSE SSE2 SSE3 SSSE3 SSE41
: numpy/core/src/umath/_umath_tests.dispatch.c
: numpy/core/src/umath/loops_arithmetic.dispatch.c
:
SSE42 : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2
Extra checks: none
Detect : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42
: numpy/core/src/_simd/_simd.dispatch.c
:
AVX2 : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mavx2
Extra checks: none
Detect : AVX F16C AVX2
: numpy/core/src/umath/_umath_tests.dispatch.c
: numpy/core/src/umath/loops_arithm_fp.dispatch.c
: numpy/core/src/umath/loops_arithmetic.dispatch.c
:
(FMA3 AVX2) : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2
Extra checks: none
Detect : AVX F16C FMA3 AVX2
: numpy/core/src/_simd/_simd.dispatch.c
: numpy/core/src/umath/loops_exponent_log.dispatch.c
: numpy/core/src/umath/loops_trigonometric.dispatch.c
:
AVX512F : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f
Extra checks: AVX512F_REDUCE
Detect : AVX512F
: numpy/core/src/_simd/_simd.dispatch.c
: numpy/core/src/umath/loops_arithm_fp.dispatch.c
: numpy/core/src/umath/loops_arithmetic.dispatch.c
: numpy/core/src/umath/loops_exponent_log.dispatch.c
: numpy/core/src/umath/loops_trigonometric.dispatch.c
:
AVX512_SKX : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq
Extra checks: AVX512BW_MASK AVX512DQ_MASK
Detect : AVX512_SKX
: numpy/core/src/_simd/_simd.dispatch.c
: numpy/core/src/umath/loops_arithmetic.dispatch.c
: numpy/core/src/umath/loops_exponent_log.dispatch.c
|
7d37f7e to
a2c5af9
Compare
|
Hey @seiko2plus , any more tests/changes needed? |
seiko2plus
left a comment
There was a problem hiding this comment.
LGTM, just one thing left.
|
@ganesh-k13, My apologies for the delayed response, here another benchmark that covers other archs. X86CPUArchitecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping: 10
CPU MHz: 1800.344
CPU max MHz: 4000.0000
CPU min MHz: 400.0000
BogoMIPS: 3984.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adxOSLinux seiko-pc 5.8.0-48-generic #54-Ubuntu SMP Fri Mar 19 14:25:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0BenchmarkAVX2python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [a2c5af9c]
<enh_simd_npyv_floor_div>
- 22.3±0.4μs 3.31±0.07μs 0.15 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 22.5±0.4μs 3.28±0.05μs 0.15 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 70.9±1μs 7.58±0.07μs 0.11 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
- 74.1±3μs 7.72±0.2μs 0.10 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
- 24.8±0.4μs 2.10±0.03μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 21.8±0.5μs 1.80±0.04μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
- 25.5±0.6μs 2.05±0.06μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 22.2±0.2μs 1.77±0.05μs 0.08 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)SSE41export NPY_DISABLE_CPU_FEATURES="AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [a2c5af9c]
<enh_simd_npyv_floor_div>
- 22.9±0.4μs 4.50±0.2μs 0.20 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 23.7±1μs 4.38±0.06μs 0.18 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 70.5±1μs 11.6±0.3μs 0.16 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
- 77.8±3μs 11.5±0.2μs 0.15 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
- 22.5±0.2μs 2.05±0.2μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
- 25.2±0.4μs 2.29±0.05μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 22.3±0.8μs 1.98±0.04μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 25.2±0.6μs 2.19±0.03μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)SSE3export NPY_DISABLE_CPU_FEATURES="SSE41 AVX2"
python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [a2c5af9c]
<enh_simd_npyv_floor_div>
- 22.8±0.5μs 4.82±0.1μs 0.21 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 22.9±1μs 4.78±0.1μs 0.21 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 72.8±3μs 12.2±0.6μs 0.17 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 8)
- 71.7±1μs 12.0±0.5μs 0.17 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 43)
- 25.7±1μs 2.43±0.1μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 25.2±0.3μs 2.38±0.05μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 22.6±0.5μs 2.11±0.04μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 22.5±0.4μs 2.10±0.03μs 0.09 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)Power little-endianCPUOSLinux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) BenchmarkVSX2python runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[623bc1fa] [46b8cfc3]
<enh_simd_npyv_floor_div>
- 16.9±0.02μs 10.1±0.01μs 0.59 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 16.3±0.03μs 7.25±0.01μs 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 16.4±0.05μs 7.24±0.01μs 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 25.9±0.03μs 10.1±0.01μs 0.39 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 16.6±0.03μs 5.56±0.01μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 16.6±0.03μs 5.55±0.02μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)AArch64CPUArchitecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 2314.0000
CPU min MHz: 403.0000
BogoMIPS: 52.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuidOSLinux localhost 4.14.113-seiko_fastboot #30 SMP PREEMPT Wed Dec 30 12:28:43 IST 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0BenchmarkNEONpython runtests.py --bench-compare parent/main time_floor_divide_int before after ratio
[036f6c68] [f3eb831d]
<enh_simd_npyv_floor_div>
- 28.9±0.08μs 19.9±0.04μs 0.69 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
- 28.6±0.08μs 12.2±0.03μs 0.43 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
- 52.3±0.05μs 20.0±0.04μs 0.38 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
- 34.6±0.08μs 12.1±0.03μs 0.35 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
- 28.0±0.06μs 8.05±0.04μs 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
- 28.0±0.05μs 8.04±0.05μs 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43) |
Co-authored-by: Sayed Adel <seiko@imavr.com>
|
Thanks a lot @seiko2plus! |
|
Also can I raise a PR for signed types now, @seiko2plus ? I'll purge libdivide where needed. |
|
Thanks @ganesh-k13 |
|
@ganesh-k13, sure you can, thank you! |
|
The change the change in the loops caused a small regression for: conveniently found by pandas in pandas-dev/pandas#40874 I think all that is needed is to ensure the loop order is not change in the code generation chunk (unsigned loops before signed ones). And a test to make sure we are the ones who notice it next time would be nice ;). I can create a PR, but if you beat me to it, even better :). EDIT: The current and new loops are as follows (So I got it wrong above, it is signed before unsigned – of identical precision): DetailsNew loops: |
|
Hey Sebastian, sorry for the regression. I have made the change locally and able to fix with your RCA. I am adding tests and will raise a PR shortly. |
BUG: Regression #18075 | Fixing Ufunc TD generation order
Dispatch for unsigned floor division
Hi @seiko2plus / @Qiyu8 , I am attempting to add fast integer division using universal intrinsics. I wanted your opinion on my approach.
I have marked much of the hardcoded parts with
// XXX(everything is for16bitand onlysigned). Now I am able to understand the dispatch mechanism to an extent and hit the code paths through dispatch(not through below diff, I hardcoded fewrun_binary_simd_*for that). But there are few things that do not work with integer types like load and store.We can add stuff in memory.h, but wanted your opinion on the load-store part. Also in general am I on the right path?cc: @seberg