ENH: Added libdivide for floor divide#17727
Conversation
|
It seems this PR conflates two optimizations:
I would be interested to see if the second optimization alone is enough to convince the compiler to optimize the division, as was mentioned in the original issue. |
|
Hmm, that is true, maybe we should start off with some simple optimizations (see if we can remove the As far as I can tell, libdivide also supports AVX2 and AVX512 (if defines are set), which we could use but will require looking into the universal intrinsics probably. |
|
I used the PEP definition of
Unfortunately, no improvement or not anything visible, attaching my results: Please see the code changes I did in the latest commit, I expected at least a 20-30% improvement, is there an error in my logic? |
|
According to the license of libdivide, we can use the zlib variant, which allows us to distribute and modify the source. We need to be careful to document any future modifications for Universal SIMD and should give attribution in our LICENSES_bundled.txt |
|
How can we set special build environment variables for the CI to pickup, modify the description? I want |
|
Under which build conditions would it be better not to use |
|
Will do these later this week:
|
|
If the header and license is not included in the tarball when doing Edit: |
|
The files are getting included in the tar ball @mattip : |
|
Currently, I am trying to break down libdivide header by following: #13516 (comment) I hope this is the right approach to support multi-platform SIMD |
|
I think it is fine to leave universal simd for another PR. Other reviewers: should this hit the mailing list/get a release note? |
|
Oh, sure @mattip , it seems like theres more we can do with libdivide itself, different algos we can choose from, etc. Right now it's the most generic one, aimed to work with compatibility. Will raise future PRs with experimentations. |
d3d932e to
72dcc04
Compare
|
Since this doesn't change anything user-facing, I don't think we need to hit the mailing list or even add release note. It might be nice to make a release note to summarize some speed improved also due to SIMD (maybe we should add a performance category to the release notes, since performance changes always seem to interest everyone even though they are rarely important from a compatibility perspective). We should double check that zlib is fine for inclusion, so that this has definitely no affect on the NumPy license. |
I want some such section in the release notes and would be grateful is someone else provides it. |
numpy/core/src/umath/loops.c.src
Outdated
| npy_set_floatstatus_divbyzero(); | ||
| *((@type@ *)op1) = 0; | ||
| } | ||
| else if (((in1 > 0) != (in2 > 0)) && (in1 % in2 != 0)) { |
There was a problem hiding this comment.
Can libdivide compute in1 % in2 for you? It seems a bit silly to use libdivide only to then perform a remainder calculation without it.
There was a problem hiding this comment.
I honestly think we can avoid this % and rewrite it as postprocessing?
There was a problem hiding this comment.
So I still don't know how removing the % gave no performance boost :). The compiler is magically optimizing something. #17727 (comment)
There was a problem hiding this comment.
Oh, interesting. @ganesh-k13 two things: First make sure you are dividing a positive by a negative number (or vice versa), otherwise this is not hit at all. Second, was the timing difference with libdivide? I guess it might be the compiler is smart enough to optimize the modulo away, but I would be surprised if it is smart enough when libdivide is being used?
There was a problem hiding this comment.
They seem to have not done it yet: ridiculousfish/libdivide#9
There was a problem hiding this comment.
All this changes is subtract one for rounding purproses, Now unless there is some edge case again, I think you can just do without the subtract, and then move the if to later, so that if (res < 0) && (res * in2 != in1) { res -= 1}?
There was a problem hiding this comment.
You were right about not hitting the case, @seberg , seems like in the profile script I forgot to invert the signs. Above method seems to work, few edge cases to iron out(like <= 0, etc), will try them.
There was a problem hiding this comment.
I found three edge cases:
- res is
0, then possible negative divisor/dividend - divisor is 0 or -0, handled by putting inside else
- dividend is 0, handled by the same logic as 1.
Let me know if any more are there.
[EDIT]: Can use the same logic in sliding as well.
a94afb9 to
285d810
Compare
seberg
left a comment
There was a problem hiding this comment.
OK, thanks for all the quick followups. I am happy with the current loop macro setups (we can always change them easily anyway).
There is only one big issue remaining from my side, and that is the empty-array case, which we have to guard against in the constant branch. (And it is slightly scary tests did not find it, although it may be that we have tests for it, and we would just have to use valgrind or so to notice the issue reliably).
| assert_equal(x % 100, [5, 10, 90, 0, 95, 90, 10, 0, 80]) | ||
|
|
||
| @pytest.mark.parametrize("input_dtype", | ||
| [np.int8, np.int16, np.int32, np.int64]) |
There was a problem hiding this comment.
Probably more than necessary, but that is fine. The different unit cases are not super interesting for the division code, but frankly a bit many tests are fine :).
benchmarks/benchmarks/bench_ufunc.py
Outdated
| self.x = np.arange(size) | ||
|
|
||
| def time_floor_divide(self, size): | ||
| self.x//8 |
There was a problem hiding this comment.
I suppose dividing by 8 is one of the best case scenarios? A bit curious how things behave for dividing by a less weird number? (but honestly, just curious, I trust that libdivide is worth it).
One thing I am more curious about is how much the speedup is for the small integers, like int8, etc? I guess there are not many specialized registers for those, so the upcast is probably just as well?
There was a problem hiding this comment.
I was not happy with the initial tests, so I rewrote them, please let me know
[EDIT] After improved commit a5e1235
before after ratio
[3dca0c71] [8912ffd9]
<master> <enh_14959-libdivide>
- 2.36±0.01μs 2.13±0.04μs 0.90 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 0)
- 2.11±0.01μs 1.84±0.05μs 0.87 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
- 2.11±0.02μs 1.79±0.01μs 0.85 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
- 2.19±0.01μs 1.66±0.01μs 0.76 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
- 2.18±0.03μs 1.65±0.01μs 0.76 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
- 45.1±0.3ms 25.6±0.05ms 0.57 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 0)
- 285±2μs 137±0.2μs 0.48 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
- 287±0.5μs 138±1μs 0.48 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
- 121±0.8ms 53.5±0.2ms 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
- 122±0.2ms 53.3±0.3ms 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
- 115±0.08ms 44.8±1ms 0.39 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
- 116±0.6ms 44.4±0.7ms 0.38 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
- 304±0.5μs 107±8μs 0.35 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
- 126±1ms 42.3±0.3ms 0.34 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
- 126±0.1ms 42.4±0.07ms 0.34 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
- 302±1μs 100±2μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
- 42.7±0.1ms 12.9±0.04ms 0.30 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 0)
- 120±0.1ms 35.1±0.09ms 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
- 118±0.2ms 34.6±0.05ms 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
- 98.8±1μs 22.5±0.2μs 0.23 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 0)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
There was a problem hiding this comment.
I did choose 8 to show nice results :) . But yeah we get some speedup for a number like 43 with int8 itself.
f5a66f3 to
ca4ba20
Compare
|
Hey @seberg, any other changes needed here? |
seberg
left a comment
There was a problem hiding this comment.
OK, I am happy to put this in, may merge in a bit. But maybe someone else wants to do a quick pass. (marked a few nitpicks, but nothing that matters). There might be some followup with other division loops possible, but not sure.
An exhaustive test for int32 might be neat (not as a test, just to be sure), but it should not be necessary.
|
Thanks @seberg , I was planning to work on the universal intrinsics. We still have not used the SSE and AVX versions of libdivide. I will make a POC like last time to see the improvements and we can take a call if it's worth the change.
We have covered ints and timedeltas, are there more? Will be happy to port to them as well in a follow-up. |
|
Hmm, might have thought in the wrong direction about other loops, I guess libdivide probably does not have any unsigned integer versions? Should just double check which |
|
libdivide does support unsigned 32 and 64 bit ints. I did a quick browse of the code, the other division loops are for float only. Universal intrinsics will be a follow up PR.
Anything in particular I can test here? Any testing strategy? The current added UT tests for all boundry case numbers. |
|
Thanks @ganesh-k13 |
|
Further implementation and test improvements can be in follow-on PRs. |

Improvements:
Specs: ryzen 3600(12 cores), 32 GB memory
Changes:
Added a macro check to use libdivide or not:NPY_USE_LIBDIVIDE%in floor divide code.*new
Profiler code:
https://gist.github.com/ganesh-k13/6201227c5d3d65902c6eaf71357e72b1
Note: This is just a POC to show some performance improvementsresolves: #14959
cc:
@eric-wieser
@mattip
@seberg