MAINT: AVX512 implementation with intrinsic for float64 input np.exp()#15648
MAINT: AVX512 implementation with intrinsic for float64 input np.exp()#15648mattip merged 12 commits intonumpy:masterfrom
Conversation
|
Cool. A few questions:
|
For the first question, I run |
sorry, |
|
Thanks for the clarifications. Is there a link to the paper that is not behind a paywall? I would hope for a larger than 2.5x speedup, I guess libm (probably using sse under the hood) does pretty well. While significant, I am not sure a 2.5x speedup justifies the complexity. I would be interested how the performance changes with array length, since there is a cost to the loop setup as well as swapping the table into memory. Do you know if libm uses a similar technique to calculate exp? |
|
if this algorithm benefits arm-based architecture, then it should consider using the universal intrinctrics framework described in 13516 , which will be merged within a few weeks, I believe that this would be good example of activating that framework. |
|
What are the ULP errors of the libm implementation? Can we do better with a mpmath or similar? |
|
@Qiyu8 in other avx512 PRs we took the approach that we would prefer to have a solid reference implementation in place, and migrate it to universal intrinsics (asusming the migration cost is low) since then we have benchmarks and accuracy tests for comparison. Although it was just my perspective, perhaps there are other opinions. |
numpy/core/src/umath/simd.inc.src
Outdated
There was a problem hiding this comment.
This seems really inefficient (a store, followed by write to memory and then load again). I bet you could optimize it either with gather instruction. Or alternatively, you could load your table EXP_T into 8 registers at the very beginning and use shuffles/permute instruction to filter what you want).
There was a problem hiding this comment.
There is no dearth of zmm registers, so loading 64 doubles into 8 registers at the very beginning should help.
There was a problem hiding this comment.
I'm reading some introduction about shuffles/permute intrinsic. I'll give some effort on it. Thanks.
numpy/core/src/umath/simd.inc.src
Outdated
There was a problem hiding this comment.
Perhaps this should be moved to some common header?
There was a problem hiding this comment.
I've moved the tables into numpy/core/include/numpy/npy_math.h. But this causes a lot of unused warning. Should I create a new header file or move the tables into other header file?
There was a problem hiding this comment.
Sorry, I missed the comment of mattip, I'll add if defined HAVE_ATTRIBUTE_TARGET_AVX512F_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS in the header file.
There was a problem hiding this comment.
Should I create a new header file or move the tables into other header file?
You should not be adding anything to numpy/core/include, those are all public headers. I see why you are mislead: there is content in npy_math.h which does not belong there. You should create a new header file under numpy/core/src/umath. What do you think of the name numpy/core/src/umath/npy_tables.h? It would be nice to move the other non-public constants from npy_math.h into the new file, I think all the "constansts used in vector ..." can be safely moved.
There was a problem hiding this comment.
How was the reference output computed?
There was a problem hiding this comment.
The output data is generated by libm.
|
I would recommend adding more tests, specially ones that cover special cases and strides (negative and positive strides). |
About the paper, Could I send the full text from e-mail? I haven't find a free link. In libm, it has similar implementation of the scalar version. The libm scalar version code is under the path |
|
BTW, I see a 26% speed up in the logistic regression benchmark with this patch. The benchmark currently runs with float32, it can be easily updated to run with float64 where you will see this speed up. numpy/benchmarks/benchmarks/bench_avx.py Line 133 in eb167a3 |
Using permute intrinsic instead of loops to get lookup table data can improve the performance of the new code.
The previous test result of scalar version is 0.229s. So the new code is 3.7x(vs 0.229s) or 5x(vs 0.308s) faster than scalar version.
The new code can speed up 11% in the logistic regression benchmark. |
ce6c0fd to
3c8a0be
Compare
|
I have run |
numpy/core/include/numpy/npy_math.h
Outdated
| 0x3C99D3E12DD8A18B, | ||
| }; | ||
|
|
||
|
|
There was a problem hiding this comment.
This code does not belong in a public header, and needs to be bracketed with an
if defined HAVE_ATTRIBUTE_TARGET_AVX512F_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS
to avoid compiler warnings about unused variables.
There was a problem hiding this comment.
Bracket the data still causes unused warning. Move them to separate header file can avoid these warnings. Where should I add the new file?
There was a problem hiding this comment.
You should create a new header file under numpy/core/src/umath. What do you think of the name numpy/core/src/umath/npy_tables.h? It would be nice to move the other non-public constants from npy_math.h into the new file, I think all the "constansts used in vector ..." can be safely moved.
There was a problem hiding this comment.
The lookup table and "constansts used in vector ..." has been moved into numpy/core/src/umath/npy_simd_data.h.
I have provided float64 case of LogisticRegression. You can see the report by running
When |
my virtualenv version is 16.7.9 and works fine. 20.0.10 seems like Unstable or lack of components |
|
@Qiyu8 correct - this is a known issue with |
|
$ clang --version
Apple clang version 11.0.3 (clang-1103.0.32.29)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin |
|
@xiegengxin could you make those changes?
It would be really nice if we could somehow print out which loops are in play, but I guess that is something for a different PR |
|
There have been a few questions about how the known values for the tests were generated: @mattip asked
and @r-devulap asked
@xiegengxin replied
@matthew-brett asked
So, for what it's worth, I wrote a script to check the reference values with results computed using mpmath. The "correct" value is computed with mpmath using 40 digits of precision and then cast back to double precision float. It finds one value computed with mpmath that doesn't agree with the expected value in the input file. Does the method of using mpmath to compute the expected value look correct? Should |
|
That value is off by 1 ULP, right? So maybe rounding differences |
|
Right, and I'm not sure the one computed via mpmath is necessarily a better "true" value, so perhaps not worth changing. Mainly the mpmath code verifies that the reference values in the PR are good, with verification independent of libm. |
|
That script looks good, could you make a PR to add it as a test with a +-1 ULP allowance? |
|
.. and it would require installing mpmath. Maybe only run the test if import succeeds? |
What about hardcoding the results from |
|
Testing the values would ensure that the underlying platform's conversion from the input-bytes -> input-platform-float/complex -> the |
|
Updated the test cases.
Thanks. @WarrenWeckesser |
|
@mattip About the azure yml file:
|
|
Cool, macOS is now passing. I made PR #15903 to update xcode in CI. @r-devulap any thoughts? @xiegengxin this needs a release note |
|
@mattip: looks good to me. Benchmark numbers look great on SkylakeX too: If it isn't there already, I would recommend adding one more test to ensure overflow flag is raised for large floating points. Something like this: numpy/numpy/core/tests/test_umath.py Line 653 in 8c2112d |
| Use AVX512 intrinsic to implement ``np.exp`` when input is ``np.float64`` | ||
| ------------------------------------------------------------------------- | ||
| Use AVX512 intrinsic to implement ``np.exp`` when input is ``np.float64``, | ||
| which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x |
There was a problem hiding this comment.
| which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x | |
| which can improve the performance of ``np.exp`` with ``np.float64`` input 5-7x |
|
Anybody help me +1? |
|
Thanks @xiegengxin |
Because only float32 exp function has been implemented with avx512 intrinsic in numpy. When input is np.float64, np.exp() still using scalar version. I’ve implemented float64 exp function with avx512 intrinsic. The algorithm is described in ‘Tang, P.T.P., “Table-driven implementation of the exponential function in IEEE floating-point arithmetic,” ACM Transactions on Mathematical Software, vol. 15, pp. 144-157, 1989.’ Read here
I measured the execution time of np.exp by using timeit package. The test data is a 2*1024*1024 1-d array generated by np.random.randn(), and np.exp() is executed 16 times. The avx512 implementation is about 2.5x faster than the scalar version. I tested it on Centos8, and the CPU is Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz.
Tang’s paper has proved the accuracy of his algorithm is less than 0.77 ulp. I provided 104 accuracy test data in numpy/core/tests/data/umath-validation-set-exp, the accuracy test show that none of them is larger than 1 ulp.