Skip to content

BUG: fix win32 np.clip slowness#20134

Merged
charris merged 1 commit intonumpy:mainfrom
pijyoi:fix_slow_win32_clip
Oct 21, 2021
Merged

BUG: fix win32 np.clip slowness#20134
charris merged 1 commit intonumpy:mainfrom
pijyoi:fix_slow_win32_clip

Conversation

@pijyoi
Copy link
Contributor

@pijyoi pijyoi commented Oct 19, 2021

The use of the macro _NPY_CLIP results in multiple re-evaluations of the
input arguments. Thus for floating point types, the check of NaNs is
performed multiple times.
This manifests itself as a slowness on Win32 builds. See #18673.

The use of the macro _NPY_CLIP results in multiple re-evaluations of the
input arguments. Thus for floating point types, the check of NaNs is
performed multiple times.
This manifests itself as a slowness on Win32 builds. See numpy#18673.
@seberg
Copy link
Member

seberg commented Oct 19, 2021

The code here looks completely equivalent to the old one? What do you mean by multiple re-evaluations? If there is different on windows, it seems like weird MSVC optimization fluctuations?!

A speed problem on windows is usually (or was, it might be fixed), is that MSVC is unable to spit out reasonable machine code for isnan, isgreater, etc. functions that are defined in math.h. IMO, that is a clear MSVC bug. I have seen a fix for C++ isnan in MSVC once, but have no idea if they ever bothered to fix the rest (or even C!)

@seberg
Copy link
Member

seberg commented Oct 19, 2021

That said, if such a small manual macro expansion makes a big difference on windows, I guess we might as well put it in. Could you try to run the benchmarks? Hopefully it would work to go into the benchmarks folder in NumPy and run something like:

python ../runtests.py --bench-compare main--bench bench

@pijyoi
Copy link
Contributor Author

pijyoi commented Oct 19, 2021

The code here looks completely equivalent to the old one? What do you mean by multiple re-evaluations? If there is different on windows, it seems like weird MSVC optimization fluctuations?!

Run the following code through the C pre-processor:

#define _NPY_CLIP(x, min, max) _NPY_FLOAT_MIN(_NPY_FLOAT_MAX((x), (min)), (max))
#define _NPY_FLOAT_MIN(a, b) (npy_isnan(a) ? (a) : PyArray_MIN(a, b))
#define _NPY_FLOAT_MAX(a, b) (npy_isnan(a) ? (a) : PyArray_MAX(a, b))

float func1(float x, float min, float max)
{
    return _NPY_CLIP(x, min, max);
}

float func2(float x, float min, float max)
{
    float t = x;
    t = _NPY_FLOAT_MAX(t, min);
    t = _NPY_FLOAT_MIN(t, max);
    return t;
}

You will get:

float func1(float x, float min, float max)
{
    return (npy_isnan((npy_isnan((x)) ? ((x)) : PyArray_MAX((x), (min)))) ? ((npy_isnan((x)) ? ((x)) : PyArray_MAX((x), (min)))) : PyArray_MIN((npy_isnan((x)) ? ((x)) : PyArray_MAX((x), (min))), (max)));
}

float func2(float x, float min, float max)
{
    float t = x;
    t = (npy_isnan(t) ? (t) : PyArray_MAX(t, min));
    t = (npy_isnan(t) ? (t) : PyArray_MIN(t, max));
    return t;
}
npy_isnan PyArray_MAX PyArray_MIN
func1 4 3 1
func2 2 1 1

func1 has expanded to extraneous calls of npy_isnan and PyArray_MAX.
Whether MSVC optimizer is at fault here, func2's formulation makes the job easier for the compiler (MSVC or otherwise).
If it were possible, _NPY_CLIP should have been defined as an inline function, then manual macro expansion wouldn't have been necessary.

@pijyoi
Copy link
Contributor Author

pijyoi commented Oct 19, 2021

If I run the test script in #18673 (comment)
(The allqt venv is running 1.21.2 installed from pypi, while the pyqt6 venv is running this PR)

PS C:\work\python> C:\work\venv\allqt\Scripts\Activate.ps1
(allqt) PS C:\work\python> python .\time_numpy_clip.py
0.1126141
0.03663050000000001
(allqt) PS C:\work\python> deactivate
PS C:\work\python> C:\work\venv\pyqt6\Scripts\Activate.ps1
(pyqt6) PS C:\work\python> python .\time_numpy_clip.py
0.03107929999999992
0.03300119999999995

i.e. np.clip(a, mn, mx) on Windows no longer runs slower than np.clip(np.clip(a, None, mx), mn, None)) on this PR.

Could you try to run the benchmarks? Hopefully it would work to go into the benchmarks folder in NumPy and run something like: python ../runtests.py --bench-compare main--bench bench

Sorry, I am not familiar with the numpy development workflow. Running the suggested line verbatim gave me the following error.

(pyqt6) PS C:\work\github\numpy\benchmarks> python ../runtests.py --bench-compare main--bench bench
fatal: ambiguous argument 'main--bench': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
  File "../runtests.py", line 695, in <module>
    main(argv=sys.argv[1:])
  File "../runtests.py", line 338, in main
    out = subprocess.check_output(['git', 'rev-parse', commit_a])
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'rev-parse', 'main--bench']' returned non-zero exit status 128.

Using the following gave the same error.

python ../runtests.py --bench-compare main--fix_slow_win32_clip bench

@seberg
Copy link
Member

seberg commented Oct 19, 2021

Sorry, something went wrong editing that line:

python ../runtests.py --bench-compare main --bench bench_itemselection

EDIT: ARRRrrg, I had grepped for clip, but that benchmark doesn't include clip at all anyway...

@seberg
Copy link
Member

seberg commented Oct 19, 2021

But yeah, seems reasonable. I missed that it gets copied unnecessarily for the isnan call. MSVC has problems with optimizing isnan (it is always a function call when it should be a single CPU instruction), and that would probably also make it almost impossible to do any simplification of the statement.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, nevermind the discussion. The change looks good to me, and while the macro was there for a reason, the snippet is so small that it doesn't matter IMO.
I guess you could put that multi-line statement into the macro itself but I am not even sure it would be nicer (at least as long as all the paths are right next to each other).

If you are up for it, it would be great to add a benchmark to avoid a performance regression (and it would be generally nice to have). Probably as part of bench_ufunc.py. But it doesn't seem like we have a whole lot similar ones yet.

@seberg
Copy link
Member

seberg commented Oct 20, 2021

Actually, gh-19713 will conflict with this, and should also fix it. So it may be we just skip this and go straight there. Thanks though!

@pijyoi
Copy link
Contributor Author

pijyoi commented Oct 21, 2021

Ah yes, C++ should help. Guess this PR came a few months too late... I will wait for that PR to get merged.

I came across this blog which says that isnan uses intrinsics since VS2019 16.2
https://devblogs.microsoft.com/cppblog/improving-the-performance-of-standard-library-functions/
So I recompiled the main branch and re-ran the #18673 (comment) test script and I got

0.024094700000000024
0.03238609999999997

i.e. When both compiled with VS2019, main branch is faster than this PR. (24ms vs 31ms)
Which MSVC was the pypi wheel compiled with?

@charris
Copy link
Member

charris commented Oct 21, 2021

Which MSVC was the pypi wheel compiled with?

With vs2017 since 1.19.x, previously vs2015.

@charris charris merged commit 1f0cd59 into numpy:main Oct 21, 2021
@charris
Copy link
Member

charris commented Oct 21, 2021

Thanks @pijyoi .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants