Skip to content

refactor code to replace np.clip#1641

Merged
j9ac9k merged 1 commit intopyqtgraph:masterfrom
pijyoi:clip_refactor
Mar 20, 2021
Merged

refactor code to replace np.clip#1641
j9ac9k merged 1 commit intopyqtgraph:masterfrom
pijyoi:clip_refactor

Conversation

@pijyoi
Copy link
Copy Markdown
Contributor

@pijyoi pijyoi commented Mar 20, 2021

This PR refactors the np.clip regression workaround in #1632 into pg.clip_array.
It documents why np.core.umath.clip is not used (info from #1617)

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 20, 2021

@pijyoi Thanks so much for your research into this; and for tracking everything down. I'll merge this, @NilsNemitz as this impacts one of your PRs, I would consider rebasing your branch with master (after the merge goes through of course).

@j9ac9k j9ac9k merged commit 8e32eeb into pyqtgraph:master Mar 20, 2021
@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 20, 2021

It documents why np.core.umath.clip is not used (info from #1617)

I made a mistake in the comments. It's not "slightly better" but "significantly better". I got mixed up with the fits_int32 code from #1617.

  • umath.clip on GCC is slightly better than fits_int32
  • umath.clip on GCC is significantly faster than using umath.{maximum, minimum}

The main purpose of this PR was to refactor the code to avoid proliferation of the workaround.

@NilsNemitz
Copy link
Copy Markdown
Contributor

NilsNemitz commented Mar 20, 2021

@pijyoi , thanks for the work digging into this. On my system, np.clip is indeed three times slower than the alternatives.

However, I think we can improve the code a little, even if it is mostly for style points.
Here is my alternative proposal:

def clip_array(arr, vmin, vmax, out=None):
    # replacement for np.clip due to regression in
    # performance since numpy 1.17
    # https://github.com/numpy/numpy/issues/14281

    if vmin is not None:
        arr = np.core.umath.maximum(arr, vmin, out=out)
        out = arr # do not allocate again if clipping vmax
    if vmax is not None:
        arr = np.core.umath.minimum(arr, vmax, out=out)

    # np.core.umath.clip performs slightly better than
    # the above on platforms compiled with GCC (e.g. Linux),
    # but worse for CLANG (e.g. macOS) and MSVC (Windows)
    return arr
  • We can let numpy allocate the copied array entirely in C land. The explicit call to np.empty_like() is not needed.
  • When out is not given, we want the first call to umath.maximum() to reallocate a new array, but the call to umath.minimum() can clip in place.
  • Since your code generally has the behavior that passing None for one of the clipping points does no clipping, the logical behavior for min=None and max=None is for the original array to be returned.

I think this version does all that with only minor tweaks to what you wrote.

@NilsNemitz
Copy link
Copy Markdown
Contributor

Here's some testing code.
I wanted to keep the clipping busy, instead of just clipping the same array (in-place) to the same values repeatedly, so I included a loop that gradually shrinks the clipping region. I couldn't figure out how to do that with timeit.

Inelegant code:

import numpy as np
import time
from pyqtgraph.pyqtgraph import functions as fn

def clip_alternative(arr, vmin, vmax, out=None):
    # replacement for np.clip due to regression in
    # performance since numpy 1.17
    # https://github.com/numpy/numpy/issues/14281

    if vmin is not None:
        arr = np.core.umath.maximum(arr, vmin, out=out)
        out = arr # do not allocate again if clipping vmax
    if vmax is not None:
        arr = np.core.umath.minimum(arr, vmax, out=out)

    # np.core.umath.clip performs slightly better than
    # the above on platforms compiled with GCC (e.g. Linux),
    # but worse for CLANG (e.g. macOS) and MSVC (Windows)
    return arr

print('Numpy version:',np.__version__,'\n')
original = {
    '10'    : np.random.random(    10 ),
    '100'   : np.random.random(   100 ),
    '1000'  : np.random.random(  1000 ),
    '10000' : np.random.random( 10000 )
}
reference = { key: np.clip(original[key], -0.5, 0.5) for key in original }

for key in original:
    reference = np.clip( original[key], -0.5, 0.5 )
    
    realloc_clip_array = fn.clip_array( original[key], -0.5, 0.5)
    assert (realloc_clip_array == reference).all()

    realloc_alternative = clip_alternative( original[key], -0.5, 0.5)
    assert (realloc_alternative == reference).all()

    in_place_clip_array = original[key].copy()
    fn.clip_array( in_place_clip_array, -0.5, 0.5, out=in_place_clip_array )
    assert (in_place_clip_array == reference).all()

    in_place_alternative = original[key].copy()
    clip_alternative( in_place_alternative, -0.5, 0.5, out=in_place_alternative )
    assert (in_place_alternative == reference).all()
print('checked that all methods return the same results.')

for key in original:
    assert ( clip_alternative( original[key], None, None ) == original[key] ).all()
print('checked that clipping by None, None returns original array.')

print('\nTiming "copy-and-clip operation":')
for key in original:
    print('\n'+key+' array elements:')
    runs = 100_000

    t0 = time.perf_counter()
    for idx in range(runs):
        lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
        res = np.clip( original[ key ], -lim, lim)
    t1 = time.perf_counter()
    us_per_loop = ((t1-t0)/runs)*1e6
    print('np.clip          took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )

    t0 = time.perf_counter()
    for idx in range(runs):
        lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
        res = fn.clip_array( original[ key ], -lim, lim)
    t1 = time.perf_counter()
    us_per_loop = ((t1-t0)/runs)*1e6
    print('clip_arrray      took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )

    t0 = time.perf_counter()
    for idx in range(runs):
        lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
        res = clip_alternative( original[ key ], -lim, lim)
    t1 = time.perf_counter()
    us_per_loop = ((t1-t0)/runs)*1e6
    print('alternative clip took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )


print('\nTiming "clip-in-place operation":')
for key in original:
    print('\n'+key+' array elements:')
    runs = 100_000

    arr = original[ key ].copy()
    t0 = time.perf_counter()
    for idx in range(runs):
        lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
        np.clip( arr, -lim, lim, out=arr )
    t1 = time.perf_counter()
    us_per_loop = ((t1-t0)/runs)*1e6
    print('np.clip          took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )

    arr = original[ key ].copy()
    t0 = time.perf_counter()
    for idx in range(runs):
        lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
        fn.clip_array( arr, -lim, lim, out=arr )
    t1 = time.perf_counter()
    us_per_loop = ((t1-t0)/runs)*1e6
    print('clip_arrray      took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )

    arr = original[ key ].copy()
    t0 = time.perf_counter()
    for idx in range(runs):
        lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
        clip_alternative( arr, -lim, lim, out=arr )
    t1 = time.perf_counter()
    us_per_loop = ((t1-t0)/runs)*1e6
    print('alternative clip took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )

@NilsNemitz
Copy link
Copy Markdown
Contributor

For the "copy and clip" operation, the alternative code seems to slightly outperform the current version by allocating a new array once rather than twice. But mostly, style points :)
For "clip in place" both take the same time.

Timing "copy-and-clip operation":

10 array elements:
np.clip          took 30.47 us/loop, measured over 100000 loops.
clip_arrray      took 8.29 us/loop, measured over 100000 loops.
alternative clip took 6.19 us/loop, measured over 100000 loops.

100 array elements:
np.clip          took 31.85 us/loop, measured over 100000 loops.
clip_arrray      took 9.57 us/loop, measured over 100000 loops.
alternative clip took 5.62 us/loop, measured over 100000 loops.

1000 array elements:
np.clip          took 52.09 us/loop, measured over 100000 loops.
clip_arrray      took 17.54 us/loop, measured over 100000 loops.
alternative clip took 14.70 us/loop, measured over 100000 loops.

10000 array elements:
np.clip          took 241.52 us/loop, measured over 100000 loops.
clip_arrray      took 83.69 us/loop, measured over 100000 loops.
alternative clip took 78.67 us/loop, measured over 100000 loops.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 20, 2021

Style aside, there are indeed 2 bugs (or omissions)

  1. if both vmin and vmax are None, then an uninitialized array is returned
  2. for numpy compiled with GCC, umath.clip is actually significantly faster than the combination of umath.maximum and umath.minimum. However, it will be slower for macOS (CLANG) and Windows (MSVC).

Where performance of pyqtgraph is particularly affected by the performance of clipping is for largish images, on the order of 1e6 pixels.. There, clipping is but one part of the pipeline, thus it is not easy to quantify small improvements in timing of just one part. In addition, sometimes, speedups in one part of the pipeline may end up slowing down other parts of the pipeline!

slightly outperform the current version by allocating a new array once rather than twice.

I am not sure I understand this. Whether explicitly allocating an empty array or letting numpy allocate one by itself, there's only one allocation involved?

@NilsNemitz
Copy link
Copy Markdown
Contributor

True, I was thinking about dropping the empty_like() part when I wrote that.
Without that part, there is a double allocation originally:

 if vmin is not None:
    arr = np.core.umath.maximum(arr, vmin, out=out)
if vmax is not None:
    arr = np.core.umath.minimum(arr, vmax, out=out)

If out is None during umath.maximum, a new array arr is allocated.
If out is None during umath.minimum, a new array arr is allocated.
If out is None during both, two new arrays are allocated, although we only need one.

Your pre-allocation of out does avoid that.
(But we can also avoid it by setting out to the first allocated array)

@NilsNemitz
Copy link
Copy Markdown
Contributor

I can confirm the bad performance of umath.clip on my Windows, conda, numpy 1.20.1 system:

1000000 array elements:
np.clip          took 18954.27 us/loop, measured over 10000 loops.
umath.clip      took 18887.56 us/loop, measured over 10000 loops.
clip_arrray      took 8619.22 us/loop, measured over 10000 loops.
alternative clip took 8835.83 us/loop, measured over 10000 loops.

If you don't like to strip out the empty_like() call from your code, I think the undefined return value can be fixed by returning arr instead of out.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 20, 2021

For vmin and vmax being None, I was thinking along the lines of:

  1. raise an exception (i.e. consider such a use-case as illegal)
  2. return a copy of arr (if out is None)

Consider:

data = np.arange(10)
clipped = clip_array(data, None, None, out=None)
data[:] = 0

Would it be surprising that clipped now contains zeros?

@NilsNemitz
Copy link
Copy Markdown
Contributor

NilsNemitz commented Mar 20, 2021

I noticed that, but I was wondering how friendly we need to be for the None, None case.
If we find a way to do either 1 or 2 with the same amount of overhead, I prefer 2.

Well, how about this then?

def clip_array(arr, vmin, vmax, out=None):
    if vmin is not None:
        arr = np.core.umath.maximum(arr, vmin, out=out)
        out = arr # do not allocate again if clipping vmax
    if vmax is not None:
        arr = np.core.umath.minimum(arr, vmax, out=out)
        out = arr
    if out is None:
        out = arr.copy()    
    return out

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 20, 2021

for numpy compiled with GCC, umath.clip is actually significantly faster than the combination of umath.maximum and umath.minimum. However, it will be slower for macOS (CLANG) and Windows (MSVC).

I'm not sure what the performance difference here is, but would it be worth having something like

def clip_array_nonlinux(arr, vmin, vmax, out=None):
    # current implemenation

def clip_array_linux(arr, vmin, vmax, out=None):
    # implementation using np.core.umath.clip

if platform == "linux":
    clip_array = clip_array_linux
else:
    clip_array = clip_array_nonlinux

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 20, 2021

On the same AMD machine running Windows and WSL2 Ubuntu:

data = np.random.random(1000000).astype(np.float32)
out = np.ones_like(data)
%timeit np.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.minimum(np.core.umath.maximum(data, 0.1, out), 0.9, out)

Windows (np 1.20.1): 14.7ms, 14.3ms, 6ms
Ubuntu (np 1.17.4) : 3.27ms, 3.08ms, 5.33ms
Ubuntu (np 1.20.1) : 1.09ms, 1.03ms, 5.26ms

Release notes for NumPy 1.20 say that there is wider use of SIMD for ufuncs.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 20, 2021

np.clip(np.arange(10, None, None)

yields

ValueError: One of max or min must be given

So I don't think people expect a sensible result from passing in both vmin and vmax as None.
As it is, the current implementation returns an undefined result.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 20, 2021

This block is generating the following results on various platforms

data = np.random.random(1000000).astype(np.float32)
out = np.ones_like(data)
%timeit np.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.minimum(np.core.umath.maximum(data, 0.1, out), 0.9, out)

Putting some of the values into a table format here. I ran some tests on macOS

np.clip() np.core.umath.clip() np.core.umath.minimum(np.core.umath.maximum())
macOS (np 1.20.0) 622 µs 613 µs 2.84 ms
macOS (np 1.17.5) 619 µs 582 µs 2.75 ms
Win (np 1.20.1) 14.7 ms 14.3 ms 6 ms
Ubuntu (np 1.17.4) 3.27 ms 3.08 ms 5.33 ms
Ubuntu (np 1.20.1) 1.09 ms 1.03 ms 5.26

I can get some benchmarks on an windows machine with an intel processor (and I can likely do it with WSLv2 as well).

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 20, 2021

I don't think it's worth trying to optimize for older versions of numpy. If pyqtgraph users really need the best performance at their disposal, they can update to the current version of numpy IMO.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 20, 2021

It's interesting that in #1617, we got measurement results and concluded that on macOS, it was better off to use umath.minimum(umath.maximum) instead of umath.clip within rescaleData_blocked().

It would seem that taken by itself in isolation, the workaround in array_clip should only apply to Windows.

@NilsNemitz
Copy link
Copy Markdown
Contributor

Dear @pijyoi , I had a potential elegant use for checking top and bottom clips separately in PlotDataItem, and having the clipping function just "fall through" to the original data when neither requires clipping.

But I agree that sticking to the np.clip behavior is a good idea. Since checking for vmin is None and vmax is None at the beginning should essentially be free, maybe we can explicitly raise the same ValueError: One of max or min must be given? Might save some developer the surprise of having data come back random from an unintended None, None call.

I hope we will be able to strip out the helper function again once the current slowdown issue in np.clip is removed. My understanding was that there is some issue with checking old/new implementations of NaN?

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 21, 2021

@NilsNemitz , I think I have gotten mixed up with 2 different slow-downs. This is my current understanding.

  1. np.clip since 1.17 has high startup overhead
    • for small arrays, this overhead is significant. hence the workaround of using umath.clip / umath.maximum + umath.minimum
    • for large arrays, this overhead is amortized and is negligible
  2. on Windows (MSVC), the numpy compiled code for floating point comparison operations turn out to be very slow
    • it turns out that the compiled code for clip runs slower than running the combination of maximum + minimum
    • this is not true for Linux and I believe also not true for macOS

Hence, the current implementation of clip_array is penalizing Linux and macOS performance.
I am collecting some more systematic timings in #1648 to verify this.

I will fix this together with the vmin=vmax=None issue.

I hope we will be able to strip out the helper function again once the current slowdown issue in np.clip is removed. My understanding was that there is some issue with checking old/new implementations of NaN?

The question is: are you able to perceive or measure the slowdown? Is np.clip the predominant cause of slowdown in your function? If not, sticking to np.clip should be just fine.

@NilsNemitz
Copy link
Copy Markdown
Contributor

Dear @pijyoi ,

To start with the important bit:
I only replaced np.clip in PlotDataItem since it seemed like a free speed-up at the time.
It is used only in cliptoview (which I generally don't use or test) and in my #1140 dynamic range limiting code. But that only activates only for extreme data sets (zoomed in on features <1E-6 of the overall range), and I don't expect it to run in the applications that require extreme speed.

** As such, I personally have no performance concerns about just using np.clip(). **

I can confirm the np.clip problem in the tests I showed above. where I see a 4x to 5x slowdown for short data sets (slow startup), and a 3x slow-down for very long data sets (overall slowdown). This is relative to your min() // max()) code.

I don't know if @j9ac9k 's measurement can be compared across platforms, or if this involves different systems or emulation. If the numbers can be compared, it looks like the Linux / Mac OSX systems are so much faster overall, that the relative slowness of the min // max approach doesn't really matter: Even the "slow" version outperforms the "fast" Windows version.

If true, it might acceptable to optimize for the terribly broken Windows implementation, and hope to remove the workaround again as soon as the numpy guys figure out how to fix this.

What do you think?

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 21, 2021

@NilsNemitz I have created #1649 to address the issues.
I realized that you were using clip_array() to clip scalars?

it looks like the Linux / Mac OSX systems are so much faster overall, that the relative slowness of the min // max approach doesn't really matter:

I uploaded some before and after timing records in #1648. For at least Linux Intel, the penalty is non-negligible.

@NilsNemitz
Copy link
Copy Markdown
Contributor

Grrr.... that is in the "clipToView" code. Sorry, my brain went from "clipping the data to the view" to assuming that the clips actuallly work on the data. But that just uses np.clip for min/max checks. That has never been ideal.

How urgently do we need to revert that? Is it just a little slower? The time sink in clipToView would be the np.searchsorted() call, even slow clipping should be marginal in comparison.

Do you have a favorite style for clipping scalars? min( max( x, vmin ), vmax ) ?

In the code I wrote for the dynamic range limiter, I am indeed clipping arrays.

Thank you for the explanation and measurements of the slow-downs!

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 22, 2021

How urgently do we need to revert that? Is it just a little slower? The time sink in clipToView would be the np.searchsorted() call, even slow clipping should be marginal in comparison.

Nothing urgent. An additional 2 lines was added to make it support scalars. However, a scalar input will get a single-element array returned. https://github.com/pyqtgraph/pyqtgraph/blob/master/pyqtgraph/functions.py#L1017-L1018

The intention was only to support arrays. Ironically, it was the manual allocation in the original version that made it support scalars!

Do you have a favorite style for clipping scalars? min( max( x, vmin ), vmax )

Personally, I would wrap the above snippet in a python function because I have not trained myself to recognize it by sight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants