BUG: fix win32 np.clip slowness by pijyoi · Pull Request #20134 · numpy/numpy

pijyoi · 2021-10-19T06:44:51Z

The use of the macro _NPY_CLIP results in multiple re-evaluations of the
input arguments. Thus for floating point types, the check of NaNs is
performed multiple times.
This manifests itself as a slowness on Win32 builds. See #18673.

The use of the macro _NPY_CLIP results in multiple re-evaluations of the input arguments. Thus for floating point types, the check of NaNs is performed multiple times. This manifests itself as a slowness on Win32 builds. See numpy#18673.

seberg · 2021-10-19T13:07:32Z

The code here looks completely equivalent to the old one? What do you mean by multiple re-evaluations? If there is different on windows, it seems like weird MSVC optimization fluctuations?!

A speed problem on windows is usually (or was, it might be fixed), is that MSVC is unable to spit out reasonable machine code for isnan, isgreater, etc. functions that are defined in math.h. IMO, that is a clear MSVC bug. I have seen a fix for C++ isnan in MSVC once, but have no idea if they ever bothered to fix the rest (or even C!)

seberg · 2021-10-19T13:14:53Z

That said, if such a small manual macro expansion makes a big difference on windows, I guess we might as well put it in. Could you try to run the benchmarks? Hopefully it would work to go into the benchmarks folder in NumPy and run something like:

python ../runtests.py --bench-compare main--bench bench

pijyoi · 2021-10-19T20:27:18Z

The code here looks completely equivalent to the old one? What do you mean by multiple re-evaluations? If there is different on windows, it seems like weird MSVC optimization fluctuations?!

Run the following code through the C pre-processor:

#define _NPY_CLIP(x, min, max) _NPY_FLOAT_MIN(_NPY_FLOAT_MAX((x), (min)), (max))
#define _NPY_FLOAT_MIN(a, b) (npy_isnan(a) ? (a) : PyArray_MIN(a, b))
#define _NPY_FLOAT_MAX(a, b) (npy_isnan(a) ? (a) : PyArray_MAX(a, b))

float func1(float x, float min, float max)
{
    return _NPY_CLIP(x, min, max);
}

float func2(float x, float min, float max)
{
    float t = x;
    t = _NPY_FLOAT_MAX(t, min);
    t = _NPY_FLOAT_MIN(t, max);
    return t;
}

You will get:

float func1(float x, float min, float max)
{
    return (npy_isnan((npy_isnan((x)) ? ((x)) : PyArray_MAX((x), (min)))) ? ((npy_isnan((x)) ? ((x)) : PyArray_MAX((x), (min)))) : PyArray_MIN((npy_isnan((x)) ? ((x)) : PyArray_MAX((x), (min))), (max)));
}

float func2(float x, float min, float max)
{
    float t = x;
    t = (npy_isnan(t) ? (t) : PyArray_MAX(t, min));
    t = (npy_isnan(t) ? (t) : PyArray_MIN(t, max));
    return t;
}

	npy_isnan	PyArray_MAX	PyArray_MIN
func1	4	3	1
func2	2	1	1

func1 has expanded to extraneous calls of npy_isnan and PyArray_MAX.
Whether MSVC optimizer is at fault here, func2's formulation makes the job easier for the compiler (MSVC or otherwise).
If it were possible, _NPY_CLIP should have been defined as an inline function, then manual macro expansion wouldn't have been necessary.

pijyoi · 2021-10-19T20:48:24Z

If I run the test script in #18673 (comment)
(The allqt venv is running 1.21.2 installed from pypi, while the pyqt6 venv is running this PR)

PS C:\work\python> C:\work\venv\allqt\Scripts\Activate.ps1
(allqt) PS C:\work\python> python .\time_numpy_clip.py
0.1126141
0.03663050000000001
(allqt) PS C:\work\python> deactivate
PS C:\work\python> C:\work\venv\pyqt6\Scripts\Activate.ps1
(pyqt6) PS C:\work\python> python .\time_numpy_clip.py
0.03107929999999992
0.03300119999999995

i.e. np.clip(a, mn, mx) on Windows no longer runs slower than np.clip(np.clip(a, None, mx), mn, None)) on this PR.

Could you try to run the benchmarks? Hopefully it would work to go into the benchmarks folder in NumPy and run something like: python ../runtests.py --bench-compare main--bench bench

Sorry, I am not familiar with the numpy development workflow. Running the suggested line verbatim gave me the following error.

(pyqt6) PS C:\work\github\numpy\benchmarks> python ../runtests.py --bench-compare main--bench bench
fatal: ambiguous argument 'main--bench': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
Traceback (most recent call last):
  File "../runtests.py", line 695, in <module>
    main(argv=sys.argv[1:])
  File "../runtests.py", line 338, in main
    out = subprocess.check_output(['git', 'rev-parse', commit_a])
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'rev-parse', 'main--bench']' returned non-zero exit status 128.

Using the following gave the same error.

python ../runtests.py --bench-compare main--fix_slow_win32_clip bench

seberg · 2021-10-19T20:54:50Z

Sorry, something went wrong editing that line:

python ../runtests.py --bench-compare main --bench bench_itemselection

EDIT: ARRRrrg, I had grepped for clip, but that benchmark doesn't include clip at all anyway...

seberg · 2021-10-19T21:01:06Z

But yeah, seems reasonable. I missed that it gets copied unnecessarily for the isnan call. MSVC has problems with optimizing isnan (it is always a function call when it should be a single CPU instruction), and that would probably also make it almost impossible to do any simplification of the statement.

seberg

OK, nevermind the discussion. The change looks good to me, and while the macro was there for a reason, the snippet is so small that it doesn't matter IMO.
I guess you could put that multi-line statement into the macro itself but I am not even sure it would be nicer (at least as long as all the paths are right next to each other).

If you are up for it, it would be great to add a benchmark to avoid a performance regression (and it would be generally nice to have). Probably as part of bench_ufunc.py. But it doesn't seem like we have a whole lot similar ones yet.

seberg · 2021-10-20T20:32:53Z

Actually, gh-19713 will conflict with this, and should also fix it. So it may be we just skip this and go straight there. Thanks though!

pijyoi · 2021-10-21T00:45:38Z

Ah yes, C++ should help. Guess this PR came a few months too late... I will wait for that PR to get merged.

I came across this blog which says that isnan uses intrinsics since VS2019 16.2
https://devblogs.microsoft.com/cppblog/improving-the-performance-of-standard-library-functions/
So I recompiled the main branch and re-ran the #18673 (comment) test script and I got

0.024094700000000024
0.03238609999999997

i.e. When both compiled with VS2019, main branch is faster than this PR. (24ms vs 31ms)
Which MSVC was the pypi wheel compiled with?

charris · 2021-10-21T01:34:51Z

Which MSVC was the pypi wheel compiled with?

With vs2017 since 1.19.x, previously vs2015.

charris · 2021-10-21T18:09:59Z

Thanks @pijyoi .

BUG: fix win32 np.clip slowness

52b5935

The use of the macro _NPY_CLIP results in multiple re-evaluations of the input arguments. Thus for floating point types, the check of NaNs is performed multiple times. This manifests itself as a slowness on Win32 builds. See numpy#18673.

github-actions bot added the 00 - Bug label Oct 19, 2021

seberg approved these changes Oct 19, 2021

View reviewed changes

charris merged commit 1f0cd59 into numpy:main Oct 21, 2021

pijyoi mentioned this pull request Oct 22, 2021

np.clip(a, mn, mx) on Windows runs slower than np.clip(np.clip(a, None, mx), mn, None)) #18673

Closed

pijyoi deleted the fix_slow_win32_clip branch October 22, 2021 01:08

seberg mentioned this pull request Oct 27, 2021

np.clip slower in numpy 1.17 #14281

Closed

pijyoi mentioned this pull request Nov 27, 2021

win32 np.clip slowness fixed in numpy pyqtgraph/pyqtgraph#2125

Merged

pijyoi mentioned this pull request Apr 20, 2024

ENH: Speedup clip for floating point #26280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: fix win32 np.clip slowness#20134

BUG: fix win32 np.clip slowness#20134
charris merged 1 commit intonumpy:mainfrom
pijyoi:fix_slow_win32_clip

pijyoi commented Oct 19, 2021

Uh oh!

seberg commented Oct 19, 2021

Uh oh!

seberg commented Oct 19, 2021

Uh oh!

pijyoi commented Oct 19, 2021

Uh oh!

pijyoi commented Oct 19, 2021

Uh oh!

seberg commented Oct 19, 2021 •

edited

Loading

Uh oh!

seberg commented Oct 19, 2021

Uh oh!

seberg left a comment

Uh oh!

seberg commented Oct 20, 2021

Uh oh!

pijyoi commented Oct 21, 2021

Uh oh!

charris commented Oct 21, 2021

Uh oh!

charris commented Oct 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pijyoi commented Oct 19, 2021

Uh oh!

seberg commented Oct 19, 2021

Uh oh!

seberg commented Oct 19, 2021

Uh oh!

pijyoi commented Oct 19, 2021

Uh oh!

pijyoi commented Oct 19, 2021

Uh oh!

seberg commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Oct 19, 2021

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

seberg commented Oct 20, 2021

Uh oh!

pijyoi commented Oct 21, 2021

Uh oh!

charris commented Oct 21, 2021

Uh oh!

charris commented Oct 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seberg commented Oct 19, 2021 •

edited

Loading