Skip to content

ENH: Convert tanh from C universal intrinsics to C++ using Highway#25934

Merged
Mousius merged 10 commits into
numpy:mainfrom
sterrettm2:tanh-hwy
Dec 4, 2024
Merged

ENH: Convert tanh from C universal intrinsics to C++ using Highway#25934
Mousius merged 10 commits into
numpy:mainfrom
sterrettm2:tanh-hwy

Conversation

@sterrettm2

@sterrettm2 sterrettm2 commented Mar 4, 2024

Copy link
Copy Markdown
Contributor

This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.

On x86, AVX2 has a major performance improvement, while AVX512 shows a mix of small regressions and speedups.

AVX512

AVX512 stays within 10% of baseline, most being within 5%.

| Change   | Before [15691c33] <main>   | After [ddbf7be8] <tanh-hwy>   |   Ratio | Benchmark (Parameter)                                                    |
|----------|----------------------------|-------------------------------|---------|--------------------------------------------------------------------------|
| +        | 14.4±0.1μs                 | 15.5±0.01μs                   |    1.08 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')        |
| +        | 46.9±0.01μs                | 49.5±2μs                      |    1.06 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')        |
| +        | 8.22±0.1μs                 | 8.44±0.05μs                   |    1.03 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'f')        |
| +        | 893±0.1μs                  | 899±1μs                       |    1.01 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd') |
| +        | 894±0.7μs                  | 904±0.9μs                     |    1.01 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd') |
| +        | 882±0.07μs                 | 883±0.5μs                     |    1    | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'd') |
| -        | 178±0.04μs                 | 178±0.2μs                     |    1    | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f') |
| -        | 17.2±0.01μs                | 17.1±0.01μs                   |    0.99 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')        |
| -        | 172±0.2μs                  | 171±0.04μs                    |    0.99 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'f') |
| -        | 167±3μs                    | 165±0.03μs                    |    0.99 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'e') |
| -        | 4.90±0.01μs                | 4.83±0μs                      |    0.98 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'f')        |
| -        | 169±3μs                    | 165±0.06μs                    |    0.98 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'e') |
| -        | 176±6μs                    | 169±0.02μs                    |    0.96 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'e') |
| -        | 19.1±1μs                   | 17.9±0.01μs                   |    0.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')        |

AVX2

AVX2 shows a major performance improvement; thanks @jan-wassenberg for the idea to avoid slow gathers with the LUTs.

| Change   | Before [15691c33] <main>   | After [0267cd21] <tanh-hwy>   |   Ratio | Benchmark (Parameter)                                                    |
|----------|----------------------------|-------------------------------|---------|--------------------------------------------------------------------------|
| -        | 1.83±0.01ms                | 1.50±0ms                      |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd') |
| -        | 429±0.3μs                  | 351±0.8μs                     |    0.82 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f') |
| -        | 410±0.2μs                  | 333±2μs                       |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'f') |
| -        | 415±0.2μs                  | 337±0.7μs                     |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'f') |
| -        | 1.84±0.01ms                | 1.49±0ms                      |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd') |
| -        | 430±0.3μs                  | 350±0.2μs                     |    0.81 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'f') |
| -        | 1.81±0.01ms                | 1.44±0ms                      |    0.8  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'd') |
| -        | 1.81±0.01ms                | 1.44±0ms                      |    0.8  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 2, 'd') |
| -        | 3.46±0ms                   | 2.22±0ms                      |    0.64 | bench_ufunc.UFunc.time_ufunc_types('tanh')                               |
| -        | 117±0.01μs                 | 43.4±0.04μs                   |    0.37 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')        |
| -        | 115±0.04μs                 | 41.2±0.01μs                   |    0.36 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')        |
| -        | 498±8μs                    | 149±0.08μs                    |    0.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')        |
| -        | 499±8μs                    | 150±0.5μs                     |    0.3  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')        |
| -        | 475±8μs                    | 110±5μs                       |    0.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'd')        |
| -        | 102±0.02μs                 | 23.1±0.02μs                   |    0.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 2, 'f')        |
| -        | 475±9μs                    | 103±0.05μs                    |    0.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')        |
| -        | 99.7±0.02μs                | 19.7±0.01μs                   |    0.2  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'f')        |

@Mousius

Mousius commented Mar 5, 2024

Copy link
Copy Markdown
Member

This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine.

AVX512

AVX512 stays within 10% of baseline, most being within 5%.

AVX2

AVX2 has almost exact performance parity.

I'd suggest that the 1-10% variation could also be just due to it being low numbers of μs and the code being slightly different. We've had other similar things happen just by introducing code which leads to a few μs differences in unrelated code.

I'll run this on some AArch64, and see what we get, this is very exciting though and I look forward to reviewing this further 😸

@Mousius

Mousius commented Mar 5, 2024

Copy link
Copy Markdown
Member

cc @jan-wassenberg because the review box doesn't seem to want me to request review that way 🙀

@jan-wassenberg jan-wassenberg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) Some suggestions:

Comment thread meson_cpu/x86/meson.build Outdated
)
AVX2 = mod_features.new(
'AVX2', 25, implies: F16C, args: '-mavx2',
'AVX2', 25, implies: F16C, args: '-march=skylake',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skylake is considerably more than AVX2. To enable HWY_AVX2, -maes -march=haswell are sufficient.

using opmask_t = hn::Mask<decltype(f32)>;

struct hwy_f32x2 {
vec_f32 val[2];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately not portable to SVE and RISC-V. We can either use Create2 to return a native tuple (caveat: this requires relatively recent compilers, check HWY_HAVE_TUPLE), or we can just use individual in/out function args to pass the two members.

if constexpr(hn::Lanes(f64) == 8){
const vec_f64 lut0 = hn::Load(f64, lut);
const vec_f64 lut1 = hn::Load(f64, lut + 8);
return hn::TwoTablesLookupLanes(f64, lut0, lut1, hn::IndicesFromVec(f64, idx));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be worthwhile to implement an AVX2 version with 4 table lookups and blending, because gather is quite slow again.

idxs = npyv_min_s32(idxs, npyv_setall_s32(0x3e00000));
npyv_u32 idx = npyv_shri_u32(npyv_reinterpret_u32_s32(idxs), 21);
auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000));
auto nnan_m = hn::Not(hn::IsNaN(x));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the Not by swapping the arg order in the IfThenElse where this is used.

auto special_m = hn::Le(ndnan, hn::Set(s32, 0x7f000000));
auto nnan_m = hn::Not(hn::IsNaN(x));
vec_s32 idxs = hn::Sub(ndnan, hn::Set(s32, 0x3d400000));
idxs = hn::Max(idxs, hn::Set(s32, 0)); // TODO probably faster zero method

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can either use hn::Max(idxs, hn::Zero(s32)) or even hn::ZeroIfNegative(idxs).

@jan-wassenberg

Copy link
Copy Markdown
Contributor

Oh nice, great to see the AVX2 speedup, congrats!

#include "fast_loop_macros.h"


#if NPY_SIMD_FMA3 && (NPY_SIMD_F32 || NPY_SIMD_F64) // requires native FMA

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highway always has MulAdd, but we should indeed check #if HWY_HAVE_FLOAT64 here. That is only false on Armv7. We can skip the previous checks, I think.

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Mar 12, 2024
@Rohanjames1997

Copy link
Copy Markdown

Thanks for this PR!
Do you happen to know the perf impact on aarch64?

@Mousius

Mousius commented Nov 19, 2024

Copy link
Copy Markdown
Member

Thanks for the effort here @sterrettm2. I benchmarked this and got some troubling results:

| Change   | Before [f3cca787] <main>   | After [db31dce1]    |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|---------------------|---------|-----------------------------------------------------------------------------------|
| +        | 29.6±0.01μs                | 87.0±0.05μs         |    2.94 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'f')                 |
| +        | 29.5±0.02μs                | 86.9±0.04μs         |    2.94 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'f')          |
| +        | 37.4±0.05μs                | 106±1μs             |    2.84 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'f')                 |
| +        | 37.3±0.07μs                | 106±1μs             |    2.83 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'f')          |
| +        | 88.1±0.05μs                | 114±0.4μs           |    1.3  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 2, 'd')          |
| +        | 88.4±0.07μs                | 114±0.5μs           |    1.29 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 2, 'd')                 |
| +        | 84.0±0.01μs                | 103±0.2μs           |    1.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 4, 1, 'd')                 |
| +        | 84.1±0.04μs                | 103±0.1μs           |    1.23 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 4, 1, 'd')          |

Which seems to indicate that gather/scatter is performing sub-optimally. I'll see if I can figure out why 🤔

@Mousius

Mousius commented Nov 19, 2024

Copy link
Copy Markdown
Member

@jan-wassenberg this looks like we might have a regression in Highway, rolling back to google/highway@f525867 results in the gather performance improving whereas with current HEAD I'm seeing the 3x slower loads.

@jan-wassenberg

Copy link
Copy Markdown
Contributor

Interesting. If I understand correctly, the Apr12 version is faster than HEAD? Changes since then to the gather functions seem to be only google/highway@e20d36d from Apr15, is that the one that makes the difference in speed?

@Mousius

Mousius commented Nov 19, 2024

Copy link
Copy Markdown
Member

@jan-wassenberg see google/highway#2382

@Mousius Mousius left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions 😸

Comment thread numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src Outdated
Comment thread numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src Outdated
Comment thread numpy/_core/src/umath/loops_hyperbolic.dispatch.cpp.src Outdated
// index linearly. In order to keep the same vertical calculation, we
// transpose the coef. into lanes. A 4x4 transpose is all that's
// supported so we require `npyv_nlanes_f32` == 4.
#if npyv_nlanes_f32 == 4

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this with if constexpr(hn::Lanes(f32) == 4 && HWY_TARGET <= HWY_SSE4){ below?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, the HWY_TARGET condition means SSE4 or better. Lanes also isn't constexpr for SVE, so I think we'd have to check MaxLanes(f32) <= 4. Maybe we can drop the target check, because avoiding gather would be a win on any platform, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe on AArch64, the gather was still faster when I benchmarked it.

@sterrettm2 sterrettm2 Nov 22, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change; I changed the target condition to HWY_TARGET >= HWY_SSE4 which I think is correct, so AVX2 and AVX512 don't use the transposed LUTs. It'd be great if someone could double check that that's correct.

One other thing, with the way I have the code currently ordered it might be possible that both lookup tables end up in the final object, since they are above the if constexpr. I'm inclined to assume the compiler should eliminate the unused one, but I haven't verified this.

@Mousius

Mousius commented Nov 20, 2024

Copy link
Copy Markdown
Member

Overall, this seems good; I'll try to bump the highway version once the fix has landed for AArch64

@sterrettm2

Copy link
Copy Markdown
Contributor Author

I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that if constexpr.

@sterrettm2 sterrettm2 requested a review from Mousius November 28, 2024 00:41
#define TANH_TRANSPOSED_LUT
#endif // npyv_nlanes_f64 == 2
HWY_ATTR NPY_FINLINE vec_f64 lut_16_f64(const double * lut, vec_u64 idx){
if constexpr(hn::Lanes(f64) == 8){

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if constexpr will only work #if !HWY_HAVE_SCALABLE. We can wrap them, up to the final } else, in an #if. Then for SVE, the compiler will only see the { GatherIndex } codepath, which is fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is: Lanes is only constexpr on some targets, not SVE nor RVV.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can use MaxLanes as we do below.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, as we don't use scalable vectors for this routine (yet).

@jan-wassenberg

Copy link
Copy Markdown
Contributor

I've updated the version of highway, so the performance bug on AArch64 should be resolved. I've also done everything suggested in the review; the only thing is, it would be good if someone could check the condition I used for HWY_TARGET in that if constexpr.

Nice. The HWY_TARGET condition looks good to me, we are selecting all 128-bit targets, and excluding half-length AVX2 or AVX-512.

@Mousius Mousius left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, unsure how to nudge CircleCI, maybe rebase @sterrettm2 ?

@sterrettm2

Copy link
Copy Markdown
Contributor Author

I've rebased it, and those weird Circle CI failures seem to have vanished.

@Mousius Mousius merged commit 1fcd5f8 into numpy:main Dec 4, 2024
@Mousius

Mousius commented Dec 4, 2024

Copy link
Copy Markdown
Member

Thanks @sterrettm2!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants