Skip to content

MAINT: AVX512 implementation with intrinsic for float64 input np.exp()#15648

Merged
mattip merged 12 commits intonumpy:masterfrom
xiegengxin:avx512-exp-float64
Apr 23, 2020
Merged

MAINT: AVX512 implementation with intrinsic for float64 input np.exp()#15648
mattip merged 12 commits intonumpy:masterfrom
xiegengxin:avx512-exp-float64

Conversation

@xiegengxin
Copy link
Contributor

@xiegengxin xiegengxin commented Feb 26, 2020

Because only float32 exp function has been implemented with avx512 intrinsic in numpy. When input is np.float64, np.exp() still using scalar version. I’ve implemented float64 exp function with avx512 intrinsic. The algorithm is described in ‘Tang, P.T.P., “Table-driven implementation of the exponential function in IEEE floating-point arithmetic,” ACM Transactions on Mathematical Software, vol. 15, pp. 144-157, 1989.’ Read here

I measured the execution time of np.exp by using timeit package. The test data is a 2*1024*1024 1-d array generated by np.random.randn(), and np.exp() is executed 16 times. The avx512 implementation is about 2.5x faster than the scalar version. I tested it on Centos8, and the CPU is Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz.

function avx512 scalar version(libm)
time(s) 0.09 0.229

Tang’s paper has proved the accuracy of his algorithm is less than 0.77 ulp. I provided 104 accuracy test data in numpy/core/tests/data/umath-validation-set-exp, the accuracy test show that none of them is larger than 1 ulp.

@mattip
Copy link
Member

mattip commented Feb 26, 2020

Cool. A few questions:

  • Does this show up in the benchmark suite (python -mpip install asv virtualenv; python runtests.py --benchmark-compare HEAD)?
  • How much larger does the _multiarray_umath.so module grow with this code addition? We are a little concerned about code bloat.
  • This reimplements a new algorithm. Can there be any code share with the float32 version?
  • How did you generate the "correct" answers for the accuracy tests?

@xiegengxin
Copy link
Contributor Author

Cool. A few questions:

  • Does this show up in the benchmark suite (python -mpip install asv virtualenv; python runtests.py --benchmark-compare HEAD)?
  • How much larger does the _multiarray_umath.so module grow with this code addition? We are a little concerned about code bloat.
  • This reimplements a new algorithm. Can there be any code share with the float32 version?
  • How did you generate the "correct" answers for the accuracy tests?
  • With this PR, the _multiarray_umath.so is 23494216 Bytes, and without this PR, it is 23456872 Bytes. It's size has grow 37344 Bytes.
  • The output data in accuracy test is generated by libm.
  • The range reduction and Polynomial approximation is different from float32 version, also the data type is different. If it is necessary, I'll try to share some codes with float32 version to make the code shorter.

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

@mattip
Copy link
Member

mattip commented Feb 27, 2020

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

sorry, python runtests.py --bench-compare HEAD

@mattip
Copy link
Member

mattip commented Feb 27, 2020

Thanks for the clarifications.

Is there a link to the paper that is not behind a paywall?

I would hope for a larger than 2.5x speedup, I guess libm (probably using sse under the hood) does pretty well. While significant, I am not sure a 2.5x speedup justifies the complexity. I would be interested how the performance changes with array length, since there is a cost to the loop setup as well as swapping the table into memory. Do you know if libm uses a similar technique to calculate exp?

@Qiyu8
Copy link
Member

Qiyu8 commented Feb 27, 2020

if this algorithm benefits arm-based architecture, then it should consider using the universal intrinctrics framework described in 13516 , which will be merged within a few weeks, I believe that this would be good example of activating that framework.

@matthew-brett
Copy link
Contributor

What are the ULP errors of the libm implementation? Can we do better with a mpmath or similar?

@mattip
Copy link
Member

mattip commented Feb 27, 2020

@Qiyu8 in other avx512 PRs we took the approach that we would prefer to have a solid reference implementation in place, and migrate it to universal intrinsics (asusming the migration cost is low) since then we have benchmarks and accuracy tests for comparison. Although it was just my perspective, perhaps there are other opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really inefficient (a store, followed by write to memory and then load again). I bet you could optimize it either with gather instruction. Or alternatively, you could load your table EXP_T into 8 registers at the very beginning and use shuffles/permute instruction to filter what you want).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no dearth of zmm registers, so loading 64 doubles into 8 registers at the very beginning should help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading some introduction about shuffles/permute intrinsic. I'll give some effort on it. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be moved to some common header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the tables into numpy/core/include/numpy/npy_math.h. But this causes a lot of unused warning. Should I create a new header file or move the tables into other header file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the comment of mattip, I'll add if defined HAVE_ATTRIBUTE_TARGET_AVX512F_WITH_INTRINSICS && defined NPY_HAVE_SSE2_INTRINSICS in the header file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I create a new header file or move the tables into other header file?

You should not be adding anything to numpy/core/include, those are all public headers. I see why you are mislead: there is content in npy_math.h which does not belong there. You should create a new header file under numpy/core/src/umath. What do you think of the name numpy/core/src/umath/npy_tables.h? It would be nice to move the other non-public constants from npy_math.h into the new file, I think all the "constansts used in vector ..." can be safely moved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the reference output computed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output data is generated by libm.

@r-devulap
Copy link
Member

I would recommend adding more tests, specially ones that cover special cases and strides (negative and positive strides).

@xiegengxin
Copy link
Contributor Author

Thanks for the clarifications.

Is there a link to the paper that is not behind a paywall?

I would hope for a larger than 2.5x speedup, I guess libm (probably using sse under the hood) does pretty well. While significant, I am not sure a 2.5x speedup justifies the complexity. I would be interested how the performance changes with array length, since there is a cost to the loop setup as well as swapping the table into memory. Do you know if libm uses a similar technique to calculate exp?

About the paper, Could I send the full text from e-mail? I haven't find a free link.

In libm, it has similar implementation of the scalar version. The libm scalar version code is under the path glibc-2.30/sysdeps/ieee754/dbl-64/e_exp.c. And, libm also has vector version, the source code is under glibc-2.30/sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S. There also has avx2 and sse code. The avx512 code in libm is a similar implementation of table-driven method.

@r-devulap
Copy link
Member

BTW, I see a 26% speed up in the logistic regression benchmark with this patch. The benchmark currently runs with float32, it can be easily updated to run with float64 where you will see this speed up.

class LogisticRegression(Benchmark):

@xiegengxin
Copy link
Contributor Author

xiegengxin commented Mar 2, 2020

BTW, I see a 26% speed up in the logistic regression benchmark with this patch. The benchmark currently runs with float32, it can be easily updated to run with float64 where you will see this speed up.

class LogisticRegression(Benchmark):

Using permute intrinsic instead of loops to get lookup table data can improve the performance of the new code.

  1. 2*1024*1024 1-d array, execute 16 times
function avx512 scalar version(libm)
time(s) 0.061 0.308

The previous test result of scalar version is 0.229s. So the new code is 3.7x(vs 0.229s) or 5x(vs 0.308s) faster than scalar version.

  1. Logistic regression benchmark
function avx512 scalar version(libm)
time(s) 3.53 3.92

The new code can speed up 11% in the logistic regression benchmark.

@charris charris changed the title Avx512 implementation with intrinsic for float64 input np.exp() ENH: AVX512 implementation with intrinsic for float64 input np.exp() Mar 2, 2020
@charris charris changed the title ENH: AVX512 implementation with intrinsic for float64 input np.exp() MAINT: AVX512 implementation with intrinsic for float64 input np.exp() Mar 2, 2020
@xiegengxin xiegengxin force-pushed the avx512-exp-float64 branch from ce6c0fd to 3c8a0be Compare March 4, 2020 01:50
@xiegengxin
Copy link
Contributor Author

I have run python setup.py build -j 4 install --prefix $HOME/.local to build this code in macosx-10.15, and it is successful. I don't know why it is failed in the check.

0x3C99D3E12DD8A18B,
};


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does not belong in a public header, and needs to be bracketed with an

if defined HAVE_ATTRIBUTE_TARGET_AVX512F_WITH_INTRINSICS && defined  NPY_HAVE_SSE2_INTRINSICS

to avoid compiler warnings about unused variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bracket the data still causes unused warning. Move them to separate header file can avoid these warnings. Where should I add the new file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should create a new header file under numpy/core/src/umath. What do you think of the name numpy/core/src/umath/npy_tables.h? It would be nice to move the other non-public constants from npy_math.h into the new file, I think all the "constansts used in vector ..." can be safely moved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup table and "constansts used in vector ..." has been moved into numpy/core/src/umath/npy_simd_data.h.

@xiegengxin
Copy link
Contributor Author

xiegengxin commented Mar 17, 2020

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

sorry, python runtests.py --bench-compare HEAD

I have provided float64 case of LogisticRegression. You can see the report by running python runtests.py --bench-compare HEAD or python runtests.py --bench bench_avx.LogisticRegression.
bench_avx.AVX_UFunc also can test the performance.

  • LogisticRegression
dtype avx512 scalar version(libm)
numpy.float64 4.27±0.01s 4.69±0.01s
  • AVX_UFunc
stride avx512 scalar version(libm)
1 18.6±0.01μs 110±0.05μs
2 18.8±0.03μs 109±0.8μs
4 18.8±0.03μs 110±0.1μs

When asv creating virtual environment, there is a error about virtualenv option -no-site-packages. virtualenv 20.0.10 don't have this option.

@Qiyu8
Copy link
Member

Qiyu8 commented Mar 17, 2020

For the first question, I run python runtests.py --benchmark-compare HEAD, but there print 'runtests.py: error: unrecognized arguments: --benchmark-compare'.

sorry, python runtests.py --bench-compare HEAD

I have provided float64 case of LogisticRegression. You can see the report by running python runtests.py --bench-compare HEAD or python runtests.py --bench bench_avx.LogisticRegression.
bench_avx.AVX_UFunc also can test the performance.

  • LogisticRegression

dtype avx512 scalar version(libm)
numpy.float64 4.27±0.01s 4.69±0.01s

  • AVX_UFunc

stride avx512 scalar version(libm)
1 18.6±0.01μs 110±0.05μs
2 18.8±0.03μs 109±0.8μs
4 18.8±0.03μs 110±0.1μs
When asv creating virtual environment, there is a error about virtualenv option -no-site-packages. virtualenv 20.0.10 don't have this option.

my virtualenv version is 16.7.9 and works fine. 20.0.10 seems like Unstable or lack of components

@rossbar
Copy link
Contributor

rossbar commented Mar 17, 2020

@Qiyu8 correct - this is a known issue with asv and virtualenv>=20.0. The workaround is to pin virtualenv==16.7.9

@hameerabbasi
Copy link
Contributor

python setup.py install succeeds for me.

$ clang --version
Apple clang version 11.0.3 (clang-1103.0.32.29)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

@mattip
Copy link
Member

mattip commented Apr 2, 2020

@xiegengxin could you make those changes?

  • put in some macros to limit this optimization to appropriate versions of clang
  • change the the azure script to install a better compiler

It would be really nice if we could somehow print out which loops are in play, but I guess that is something for a different PR

@WarrenWeckesser
Copy link
Member

There have been a few questions about how the known values for the tests were generated:

@mattip asked

How did you generate the "correct" answers for the accuracy tests?

and @r-devulap asked

How was the reference output computed?

@xiegengxin replied

The output data in accuracy test is generated by libm.

@matthew-brett asked

What are the ULP errors of the libm implementation? Can we do better with a mpmath or similar?

So, for what it's worth, I wrote a script to check the reference values with results computed using mpmath. The "correct" value is computed with mpmath using 40 digits of precision and then cast back to double precision float.

import numpy as np
import mpmath


mpmath.mp.dps = 40

data = np.loadtxt('umath-validation-set-exp', delimiter=',',
                  dtype=str, skiprows=1)
values = [[int(inp[2:], 16), int(out[2:], 16)]
          for typ, inp, out, ulp in data if typ == 'np.float64']
arr = np.array(values, dtype=np.uint64)
fltvals = arr.view(np.float64)

for k, (x, y) in enumerate(fltvals):
    mpy = float(mpmath.exp(x))
    if y != mpy:
        print(f"index of float64 values: {k}")
        print(f"x input from file:         0x{values[k][0]:016x}")
        print(f"expected exp(x) from file: 0x{values[k][1]:016x}")
        u = np.array([mpy])
        print(f"mpmath result:             0x{u.view(np.uint64)[0]:016x}")

It finds one value computed with mpmath that doesn't agree with the expected value in the input file.

index of float64 values: 63
x input from file:         0xc086234f00000000
expected exp(x) from file: 0x000fba54632dddbf
mpmath result:             0x000fba54632dddc0

Does the method of using mpmath to compute the expected value look correct? Should 0x000fba54632dddc0 be the reference value in that case?

@mattip
Copy link
Member

mattip commented Apr 2, 2020

That value is off by 1 ULP, right? So maybe rounding differences

@WarrenWeckesser
Copy link
Member

Right, and I'm not sure the one computed via mpmath is necessarily a better "true" value, so perhaps not worth changing. Mainly the mpmath code verifies that the reference values in the PR are good, with verification independent of libm.

@mattip
Copy link
Member

mattip commented Apr 2, 2020

That script looks good, could you make a PR to add it as a test with a +-1 ULP allowance?

@mattip
Copy link
Member

mattip commented Apr 2, 2020

.. and it would require installing mpmath. Maybe only run the test if import succeeds?

@hameerabbasi
Copy link
Contributor

Maybe only run the test if import succeeds?

What about hardcoding the results from mpmath?

@mattip
Copy link
Member

mattip commented Apr 2, 2020

Testing the values would ensure that the underlying platform's conversion from the input-bytes -> input-platform-float/complex -> the mpmath function -> output-platform-float/complex -> output-platform bytes is consistent with the values in the file, both via the libc math functions and via mpmath.

@xiegengxin
Copy link
Contributor Author

Updated the test cases.

Does the method of using mpmath to compute the expected value look correct? Should 0x000fba54632dddc0 be the reference value in that case?

Thanks. @WarrenWeckesser

@xiegengxin
Copy link
Contributor Author

@mattip About the azure yml file:

  1. Should I create a new PR to updating macos version?
  2. Where can I test my change?

@mattip
Copy link
Member

mattip commented Apr 3, 2020

Cool, macOS is now passing. I made PR #15903 to update xcode in CI.

@r-devulap any thoughts?

@xiegengxin this needs a release note numpy/doc/release/upcoming_changes/15648.improvement.rst. It should state the benchmark improvement (something like 5-7x if I read your numbers correctly?) and the overall executable size increase on linux64.

@mattip mattip added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 3, 2020
@r-devulap
Copy link
Member

@mattip: looks good to me. Benchmark numbers look great on SkylakeX too:

106±0.0us       17.5±0.3us     0.16  bench_avx.AVX_UFunc.time_ufunc('exp', 4, 'd')
107±1us          17.1±0.4us     0.16  bench_avx.AVX_UFunc.time_ufunc('exp', 2, 'd')
106±4us          16.2±0.3us     0.15  bench_avx.AVX_UFunc.time_ufunc('exp', 1, 'd')

If it isn't there already, I would recommend adding one more test to ensure overflow flag is raised for large floating points. Something like this:

assert_raises(FloatingPointError, np.exp, np.float32(100.))

Use AVX512 intrinsic to implement ``np.exp`` when input is ``np.float64``
-------------------------------------------------------------------------
Use AVX512 intrinsic to implement ``np.exp`` when input is ``np.float64``,
which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
which can imporve the performance of ``np.exp`` with ``np.float64`` input 5-7x
which can improve the performance of ``np.exp`` with ``np.float64`` input 5-7x

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@xiegengxin
Copy link
Contributor Author

Anybody help me +1?

@mattip mattip merged commit 96c21f0 into numpy:master Apr 23, 2020
@mattip
Copy link
Member

mattip commented Apr 23, 2020

Thanks @xiegengxin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants