MAINT: AVX512 implementation with intrinsic for float64 input np.exp() by xiegengxin · Pull Request #15648 · numpy/numpy

xiegengxin · 2020-02-26T03:05:06Z

Because only float32 exp function has been implemented with avx512 intrinsic in numpy. When input is np.float64, np.exp() still using scalar version. I’ve implemented float64 exp function with avx512 intrinsic. The algorithm is described in ‘Tang, P.T.P., “Table-driven implementation of the exponential function in IEEE floating-point arithmetic,” ACM Transactions on Mathematical Software, vol. 15, pp. 144-157, 1989.’ Read here

I measured the execution time of np.exp by using timeit package. The test data is a 2*1024*1024 1-d array generated by np.random.randn(), and np.exp() is executed 16 times. The avx512 implementation is about 2.5x faster than the scalar version. I tested it on Centos8, and the CPU is Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz.

function	avx512	scalar version(libm)
time(s)	0.09	0.229

Tang’s paper has proved the accuracy of his algorithm is less than 0.77 ulp. I provided 104 accuracy test data in numpy/core/tests/data/umath-validation-set-exp, the accuracy test show that none of them is larger than 1 ulp.

numpy/core/src/umath/simd.inc.src

mattip · 2020-02-26T11:56:53Z

Cool. A few questions:

Does this show up in the benchmark suite (python -mpip install asv virtualenv; python runtests.py --benchmark-compare HEAD)?
How much larger does the _multiarray_umath.so module grow with this code addition? We are a little concerned about code bloat.
This reimplements a new algorithm. Can there be any code share with the float32 version?
How did you generate the "correct" answers for the accuracy tests?

xiegengxin · 2020-02-27T06:12:35Z

Cool. A few questions:

Does this show up in the benchmark suite (python -mpip install asv virtualenv; python runtests.py --benchmark-compare HEAD)?

How much larger does the _multiarray_umath.so module grow with this code addition? We are a little concerned about code bloat.

This reimplements a new algorithm. Can there be any code share with the float32 version?

How did you generate the "correct" answers for the accuracy tests?

With this PR, the _multiarray_umath.so is 23494216 Bytes, and without this PR, it is 23456872 Bytes. It's size has grow 37344 Bytes.
The output data in accuracy test is generated by libm.
The range reduction and Polynomial approximation is different from float32 version, also the data type is different. If it is necessary, I'll try to share some codes with float32 version to make the code shorter.

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

mattip · 2020-02-27T06:34:11Z

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

sorry, python runtests.py --bench-compare HEAD

numpy/core/src/umath/loops.c.src

mattip · 2020-02-27T07:33:32Z

Thanks for the clarifications.

Is there a link to the paper that is not behind a paywall?

I would hope for a larger than 2.5x speedup, I guess libm (probably using sse under the hood) does pretty well. While significant, I am not sure a 2.5x speedup justifies the complexity. I would be interested how the performance changes with array length, since there is a cost to the loop setup as well as swapping the table into memory. Do you know if libm uses a similar technique to calculate exp?

Qiyu8 · 2020-02-27T07:39:15Z

if this algorithm benefits arm-based architecture, then it should consider using the universal intrinctrics framework described in 13516 , which will be merged within a few weeks, I believe that this would be good example of activating that framework.

matthew-brett · 2020-02-27T09:23:45Z

What are the ULP errors of the libm implementation? Can we do better with a mpmath or similar?

mattip · 2020-02-27T09:40:34Z

@Qiyu8 in other avx512 PRs we took the approach that we would prefer to have a solid reference implementation in place, and migrate it to universal intrinsics (asusming the migration cost is low) since then we have benchmarks and accuracy tests for comparison. Although it was just my perspective, perhaps there are other opinions.

r-devulap · 2020-02-27T19:44:39Z

numpy/core/src/umath/simd.inc.src

This seems really inefficient (a store, followed by write to memory and then load again). I bet you could optimize it either with gather instruction. Or alternatively, you could load your table EXP_T into 8 registers at the very beginning and use shuffles/permute instruction to filter what you want).

There is no dearth of zmm registers, so loading 64 doubles into 8 registers at the very beginning should help.

I'm reading some introduction about shuffles/permute intrinsic. I'll give some effort on it. Thanks.

r-devulap · 2020-02-27T20:03:22Z

numpy/core/src/umath/simd.inc.src

Perhaps this should be moved to some common header?

I've moved the tables into numpy/core/include/numpy/npy_math.h. But this causes a lot of unused warning. Should I create a new header file or move the tables into other header file?

Sorry, I missed the comment of mattip, I'll add if defined HAVE_ATTRIBUTE_TARGET_AVX512F_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS in the header file.

Should I create a new header file or move the tables into other header file?

You should not be adding anything to numpy/core/include, those are all public headers. I see why you are mislead: there is content in npy_math.h which does not belong there. You should create a new header file under numpy/core/src/umath. What do you think of the name numpy/core/src/umath/npy_tables.h? It would be nice to move the other non-public constants from npy_math.h into the new file, I think all the "constansts used in vector ..." can be safely moved.

r-devulap · 2020-02-27T20:08:08Z

numpy/core/tests/data/umath-validation-set-exp

How was the reference output computed?

The output data is generated by libm.

r-devulap · 2020-02-27T20:08:15Z

I would recommend adding more tests, specially ones that cover special cases and strides (negative and positive strides).

xiegengxin · 2020-02-28T02:48:21Z

Thanks for the clarifications.

Is there a link to the paper that is not behind a paywall?

I would hope for a larger than 2.5x speedup, I guess libm (probably using sse under the hood) does pretty well. While significant, I am not sure a 2.5x speedup justifies the complexity. I would be interested how the performance changes with array length, since there is a cost to the loop setup as well as swapping the table into memory. Do you know if libm uses a similar technique to calculate exp?

About the paper, Could I send the full text from e-mail? I haven't find a free link.

In libm, it has similar implementation of the scalar version. The libm scalar version code is under the path glibc-2.30/sysdeps/ieee754/dbl-64/e_exp.c. And, libm also has vector version, the source code is under glibc-2.30/sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S. There also has avx2 and sse code. The avx512 code in libm is a similar implementation of table-driven method.

r-devulap · 2020-02-28T19:00:13Z

BTW, I see a 26% speed up in the logistic regression benchmark with this patch. The benchmark currently runs with float32, it can be easily updated to run with float64 where you will see this speed up.

numpy/benchmarks/benchmarks/bench_avx.py

Line 133 in eb167a3

class LogisticRegression(Benchmark):

xiegengxin · 2020-03-02T02:43:07Z

BTW, I see a 26% speed up in the logistic regression benchmark with this patch. The benchmark currently runs with float32, it can be easily updated to run with float64 where you will see this speed up.

numpy/benchmarks/benchmarks/bench_avx.py

Line 133 in eb167a3

class LogisticRegression(Benchmark):

Using permute intrinsic instead of loops to get lookup table data can improve the performance of the new code.

2*1024*1024 1-d array, execute 16 times

function	avx512	scalar version(libm)
time(s)	0.061	0.308

The previous test result of scalar version is 0.229s. So the new code is 3.7x(vs 0.229s) or 5x(vs 0.308s) faster than scalar version.

Logistic regression benchmark

function	avx512	scalar version(libm)
time(s)	3.53	3.92

The new code can speed up 11% in the logistic regression benchmark.

xiegengxin · 2020-03-11T01:22:02Z

I have run python setup.py build -j 4 install --prefix $HOME/.local to build this code in macosx-10.15, and it is successful. I don't know why it is failed in the check.

mattip · 2020-03-12T08:14:17Z

numpy/core/include/numpy/npy_math.h

+    0x3C99D3E12DD8A18B,
+};
+
+


This code does not belong in a public header, and needs to be bracketed with an

if defined HAVE_ATTRIBUTE_TARGET_AVX512F_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS

to avoid compiler warnings about unused variables.

Bracket the data still causes unused warning. Move them to separate header file can avoid these warnings. Where should I add the new file?

You should create a new header file under numpy/core/src/umath. What do you think of the name numpy/core/src/umath/npy_tables.h? It would be nice to move the other non-public constants from npy_math.h into the new file, I think all the "constansts used in vector ..." can be safely moved.

The lookup table and "constansts used in vector ..." has been moved into numpy/core/src/umath/npy_simd_data.h.

xiegengxin · 2020-03-17T02:01:28Z

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

sorry, python runtests.py --bench-compare HEAD

I have provided float64 case of LogisticRegression. You can see the report by running python runtests.py --bench-compare HEAD or python runtests.py --bench bench_avx.LogisticRegression.
bench_avx.AVX_UFunc also can test the performance.

LogisticRegression

dtype	avx512	scalar version(libm)
numpy.float64	4.27±0.01s	4.69±0.01s

AVX_UFunc

stride	avx512	scalar version(libm)
1	18.6±0.01μs	110±0.05μs
2	18.8±0.03μs	109±0.8μs
4	18.8±0.03μs	110±0.1μs

When asv creating virtual environment, there is a error about virtualenv option -no-site-packages. virtualenv 20.0.10 don't have this option.

Qiyu8 · 2020-03-17T03:08:50Z

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

sorry, python runtests.py --bench-compare HEAD

I have provided float64 case of LogisticRegression. You can see the report by running python runtests.py --bench-compare HEAD or python runtests.py --bench bench_avx.LogisticRegression.
bench_avx.AVX_UFunc also can test the performance.

LogisticRegression

dtype avx512 scalar version(libm)
numpy.float64 4.27±0.01s 4.69±0.01s

AVX_UFunc

stride avx512 scalar version(libm)
1 18.6±0.01μs 110±0.05μs
2 18.8±0.03μs 109±0.8μs
4 18.8±0.03μs 110±0.1μs
When asv creating virtual environment, there is a error about virtualenv option -no-site-packages. virtualenv 20.0.10 don't have this option.

my virtualenv version is 16.7.9 and works fine. 20.0.10 seems like Unstable or lack of components

rossbar · 2020-03-17T04:08:50Z

@Qiyu8 correct - this is a known issue with asv and virtualenv>=20.0. The workaround is to pin virtualenv==16.7.9

hameerabbasi · 2020-04-02T08:10:27Z

python setup.py install succeeds for me.

$ clang --version
Apple clang version 11.0.3 (clang-1103.0.32.29)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

mattip · 2020-04-02T08:22:54Z

@xiegengxin could you make those changes?

put in some macros to limit this optimization to appropriate versions of clang
change the the azure script to install a better compiler

It would be really nice if we could somehow print out which loops are in play, but I guess that is something for a different PR

WarrenWeckesser · 2020-04-02T08:37:33Z

There have been a few questions about how the known values for the tests were generated:

@mattip asked

How did you generate the "correct" answers for the accuracy tests?

and @r-devulap asked

How was the reference output computed?

@xiegengxin replied

The output data in accuracy test is generated by libm.

@matthew-brett asked

What are the ULP errors of the libm implementation? Can we do better with a mpmath or similar?

So, for what it's worth, I wrote a script to check the reference values with results computed using mpmath. The "correct" value is computed with mpmath using 40 digits of precision and then cast back to double precision float.

import numpy as np
import mpmath


mpmath.mp.dps = 40

data = np.loadtxt('umath-validation-set-exp', delimiter=',',
                  dtype=str, skiprows=1)
values = [[int(inp[2:], 16), int(out[2:], 16)]
          for typ, inp, out, ulp in data if typ == 'np.float64']
arr = np.array(values, dtype=np.uint64)
fltvals = arr.view(np.float64)

for k, (x, y) in enumerate(fltvals):
    mpy = float(mpmath.exp(x))
    if y != mpy:
        print(f"index of float64 values: {k}")
        print(f"x input from file:         0x{values[k][0]:016x}")
        print(f"expected exp(x) from file: 0x{values[k][1]:016x}")
        u = np.array([mpy])
        print(f"mpmath result:             0x{u.view(np.uint64)[0]:016x}")

It finds one value computed with mpmath that doesn't agree with the expected value in the input file.

index of float64 values: 63
x input from file:         0xc086234f00000000
expected exp(x) from file: 0x000fba54632dddbf
mpmath result:             0x000fba54632dddc0

Does the method of using mpmath to compute the expected value look correct? Should 0x000fba54632dddc0 be the reference value in that case?

mattip · 2020-04-02T10:49:32Z

That value is off by 1 ULP, right? So maybe rounding differences

WarrenWeckesser · 2020-04-02T11:05:47Z

Right, and I'm not sure the one computed via mpmath is necessarily a better "true" value, so perhaps not worth changing. Mainly the mpmath code verifies that the reference values in the PR are good, with verification independent of libm.

mattip · 2020-04-02T12:10:42Z

That script looks good, could you make a PR to add it as a test with a +-1 ULP allowance?

mattip · 2020-04-02T12:11:36Z

.. and it would require installing mpmath. Maybe only run the test if import succeeds?

hameerabbasi · 2020-04-02T12:13:13Z

Maybe only run the test if import succeeds?

What about hardcoding the results from mpmath?

mattip · 2020-04-02T12:38:38Z

Testing the values would ensure that the underlying platform's conversion from the input-bytes -> input-platform-float/complex -> the mpmath function -> output-platform-float/complex -> output-platform bytes is consistent with the values in the file, both via the libc math functions and via mpmath.

xiegengxin · 2020-04-03T04:07:40Z

Updated the test cases.

Does the method of using mpmath to compute the expected value look correct? Should 0x000fba54632dddc0 be the reference value in that case?

Thanks. @WarrenWeckesser

xiegengxin · 2020-04-03T07:24:46Z

@mattip About the azure yml file:

Should I create a new PR to updating macos version?
Where can I test my change?

mattip · 2020-04-03T08:00:27Z

Cool, macOS is now passing. I made PR #15903 to update xcode in CI.

@r-devulap any thoughts?

@xiegengxin this needs a release note numpy/doc/release/upcoming_changes/15648.improvement.rst. It should state the benchmark improvement (something like 5-7x if I read your numbers correctly?) and the overall executable size increase on linux64.

r-devulap · 2020-04-03T15:29:20Z

@mattip: looks good to me. Benchmark numbers look great on SkylakeX too:

106±0.0us       17.5±0.3us     0.16  bench_avx.AVX_UFunc.time_ufunc('exp', 4, 'd')
107±1us          17.1±0.4us     0.16  bench_avx.AVX_UFunc.time_ufunc('exp', 2, 'd')
106±4us          16.2±0.3us     0.15  bench_avx.AVX_UFunc.time_ufunc('exp', 1, 'd')

If it isn't there already, I would recommend adding one more test to ensure overflow flag is raised for large floating points. Something like this:

numpy/numpy/core/tests/test_umath.py

Line 653 in 8c2112d

assert_raises(FloatingPointError, np.exp, np.float32(100.))

mattip · 2020-04-08T05:50:12Z

doc/release/upcoming_changes/15648.improvement.rst

+Use AVX512 intrinsic to implement ``np.exp`` when input is ``np.float64``
+-------------------------------------------------------------------------
+Use AVX512 intrinsic to implement ``np.exp`` when input is ``np.float64``, 
+which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x


Suggested change

which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x

which can improve the performance of ``np.exp`` with ``np.float64`` input 5-7x

mattip

Looks good to me.

xiegengxin · 2020-04-23T01:22:18Z

Anybody help me +1?

mattip · 2020-04-23T05:14:29Z

Thanks @xiegengxin

mattip reviewed Feb 26, 2020

View reviewed changes

numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved

mattip reviewed Feb 27, 2020

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

r-devulap reviewed Feb 27, 2020

View reviewed changes

charris changed the title ~~Avx512 implementation with intrinsic for float64 input np.exp()~~ ENH: AVX512 implementation with intrinsic for float64 input np.exp() Mar 2, 2020

charris changed the title ~~ENH: AVX512 implementation with intrinsic for float64 input np.exp()~~ MAINT: AVX512 implementation with intrinsic for float64 input np.exp() Mar 2, 2020

charris added 01 - Enhancement 03 - Maintenance component: numpy._core labels Mar 2, 2020

MAINT: avx512 implementation with intrinsic for float64 np.exp()

3c8a0be

xiegengxin force-pushed the avx512-exp-float64 branch from ce6c0fd to 3c8a0be Compare March 4, 2020 01:50

mattip reviewed Mar 12, 2020

View reviewed changes

xiegengxin added 2 commits March 16, 2020 14:50

MAINT: fix unicode error

b1cb23b

MAINT: provide float64 logisticregression bench

a5cfc8e

xiegengxin added 2 commits April 3, 2020 11:09

MAINT: Skip compiling AVX512_exp_DOUBLE when clang <= 10.0.0

9254770

MAINT: Update more accuracy expected result of exp(x)

39d8c18

mattip mentioned this pull request Apr 3, 2020

BLD: report clang version on macOS #15903

Merged

mattip added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 3, 2020

xiegengxin added 3 commits April 7, 2020 15:11

MAINT: Add test cases about +/- INF +/- NAN and overflow

d3f5fc5

DOC: Release note about upcoming changes

1cabf5b

MAINT: Add test case of strides input & more comments

d47b21d

mattip reviewed Apr 8, 2020

View reviewed changes

MAINT: Fixed some spelling mistakes in 15648.improvement.rst

6bd0cb9

grlee77 mentioned this pull request Apr 8, 2020

Unbelievably Slow---Element wise operation(exp) to 3D array. #15933

Closed

mattip merged commit 96c21f0 into numpy:master Apr 23, 2020

mattip mentioned this pull request Apr 27, 2020

TST: improve the ufunc accuracy tests in test_umath_accuracy #15605

Open

Qiyu8 mentioned this pull request Jun 16, 2020

MAINT: Avx512 intrinsics implementation for float64 input np.log #16605

Merged

r-devulap mentioned this pull request Aug 18, 2021

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

Merged

	which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x
	which can improve the performance of ``np.exp`` with ``np.float64`` input 5-7x

Uh oh!

Conversation

xiegengxin commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mattip commented Feb 26, 2020

Uh oh!

xiegengxin commented Feb 27, 2020

Uh oh!

mattip commented Feb 27, 2020

Uh oh!

Uh oh!

mattip commented Feb 27, 2020

Uh oh!

Qiyu8 commented Feb 27, 2020

Uh oh!

matthew-brett commented Feb 27, 2020

Uh oh!

mattip commented Feb 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-devulap commented Feb 27, 2020

Uh oh!

xiegengxin commented Feb 28, 2020

Uh oh!

r-devulap commented Feb 28, 2020

Uh oh!

xiegengxin commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiegengxin commented Mar 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiegengxin commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qiyu8 commented Mar 17, 2020

Uh oh!

rossbar commented Mar 17, 2020

Uh oh!

hameerabbasi commented Apr 2, 2020

Uh oh!

mattip commented Apr 2, 2020

Uh oh!

WarrenWeckesser commented Apr 2, 2020

Uh oh!

mattip commented Apr 2, 2020

Uh oh!

WarrenWeckesser commented Apr 2, 2020

Uh oh!

mattip commented Apr 2, 2020

Uh oh!

mattip commented Apr 2, 2020

Uh oh!

hameerabbasi commented Apr 2, 2020

xiegengxin commented Feb 26, 2020 •

edited

Loading

xiegengxin commented Mar 2, 2020 •

edited

Loading

xiegengxin commented Mar 17, 2020 •

edited

Loading

mattip left a comment •

edited

Loading