refactor code to replace np.clip#1641
Conversation
|
@pijyoi Thanks so much for your research into this; and for tracking everything down. I'll merge this, @NilsNemitz as this impacts one of your PRs, I would consider rebasing your branch with master (after the merge goes through of course). |
I made a mistake in the comments. It's not "slightly better" but "significantly better". I got mixed up with the fits_int32 code from #1617.
The main purpose of this PR was to refactor the code to avoid proliferation of the workaround. |
|
@pijyoi , thanks for the work digging into this. On my system, np.clip is indeed three times slower than the alternatives. However, I think we can improve the code a little, even if it is mostly for style points. def clip_array(arr, vmin, vmax, out=None):
# replacement for np.clip due to regression in
# performance since numpy 1.17
# https://github.com/numpy/numpy/issues/14281
if vmin is not None:
arr = np.core.umath.maximum(arr, vmin, out=out)
out = arr # do not allocate again if clipping vmax
if vmax is not None:
arr = np.core.umath.minimum(arr, vmax, out=out)
# np.core.umath.clip performs slightly better than
# the above on platforms compiled with GCC (e.g. Linux),
# but worse for CLANG (e.g. macOS) and MSVC (Windows)
return arr
I think this version does all that with only minor tweaks to what you wrote. |
|
Here's some testing code. Inelegant code: import numpy as np
import time
from pyqtgraph.pyqtgraph import functions as fn
def clip_alternative(arr, vmin, vmax, out=None):
# replacement for np.clip due to regression in
# performance since numpy 1.17
# https://github.com/numpy/numpy/issues/14281
if vmin is not None:
arr = np.core.umath.maximum(arr, vmin, out=out)
out = arr # do not allocate again if clipping vmax
if vmax is not None:
arr = np.core.umath.minimum(arr, vmax, out=out)
# np.core.umath.clip performs slightly better than
# the above on platforms compiled with GCC (e.g. Linux),
# but worse for CLANG (e.g. macOS) and MSVC (Windows)
return arr
print('Numpy version:',np.__version__,'\n')
original = {
'10' : np.random.random( 10 ),
'100' : np.random.random( 100 ),
'1000' : np.random.random( 1000 ),
'10000' : np.random.random( 10000 )
}
reference = { key: np.clip(original[key], -0.5, 0.5) for key in original }
for key in original:
reference = np.clip( original[key], -0.5, 0.5 )
realloc_clip_array = fn.clip_array( original[key], -0.5, 0.5)
assert (realloc_clip_array == reference).all()
realloc_alternative = clip_alternative( original[key], -0.5, 0.5)
assert (realloc_alternative == reference).all()
in_place_clip_array = original[key].copy()
fn.clip_array( in_place_clip_array, -0.5, 0.5, out=in_place_clip_array )
assert (in_place_clip_array == reference).all()
in_place_alternative = original[key].copy()
clip_alternative( in_place_alternative, -0.5, 0.5, out=in_place_alternative )
assert (in_place_alternative == reference).all()
print('checked that all methods return the same results.')
for key in original:
assert ( clip_alternative( original[key], None, None ) == original[key] ).all()
print('checked that clipping by None, None returns original array.')
print('\nTiming "copy-and-clip operation":')
for key in original:
print('\n'+key+' array elements:')
runs = 100_000
t0 = time.perf_counter()
for idx in range(runs):
lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
res = np.clip( original[ key ], -lim, lim)
t1 = time.perf_counter()
us_per_loop = ((t1-t0)/runs)*1e6
print('np.clip took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )
t0 = time.perf_counter()
for idx in range(runs):
lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
res = fn.clip_array( original[ key ], -lim, lim)
t1 = time.perf_counter()
us_per_loop = ((t1-t0)/runs)*1e6
print('clip_arrray took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )
t0 = time.perf_counter()
for idx in range(runs):
lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
res = clip_alternative( original[ key ], -lim, lim)
t1 = time.perf_counter()
us_per_loop = ((t1-t0)/runs)*1e6
print('alternative clip took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )
print('\nTiming "clip-in-place operation":')
for key in original:
print('\n'+key+' array elements:')
runs = 100_000
arr = original[ key ].copy()
t0 = time.perf_counter()
for idx in range(runs):
lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
np.clip( arr, -lim, lim, out=arr )
t1 = time.perf_counter()
us_per_loop = ((t1-t0)/runs)*1e6
print('np.clip took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )
arr = original[ key ].copy()
t0 = time.perf_counter()
for idx in range(runs):
lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
fn.clip_array( arr, -lim, lim, out=arr )
t1 = time.perf_counter()
us_per_loop = ((t1-t0)/runs)*1e6
print('clip_arrray took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) )
arr = original[ key ].copy()
t0 = time.perf_counter()
for idx in range(runs):
lim = 1.0 - (idx/runs) # gradually tighten the clipping range from 1.0 to 0.0
clip_alternative( arr, -lim, lim, out=arr )
t1 = time.perf_counter()
us_per_loop = ((t1-t0)/runs)*1e6
print('alternative clip took {:.2f} us/loop, measured over {:d} loops.'.format(us_per_loop, runs) ) |
|
For the "copy and clip" operation, the alternative code seems to slightly outperform the current version by allocating a new array once rather than twice. But mostly, style points :) |
|
Style aside, there are indeed 2 bugs (or omissions)
Where performance of pyqtgraph is particularly affected by the performance of clipping is for largish images, on the order of 1e6 pixels.. There, clipping is but one part of the pipeline, thus it is not easy to quantify small improvements in timing of just one part. In addition, sometimes, speedups in one part of the pipeline may end up slowing down other parts of the pipeline!
I am not sure I understand this. Whether explicitly allocating an empty array or letting numpy allocate one by itself, there's only one allocation involved? |
|
True, I was thinking about dropping the if vmin is not None:
arr = np.core.umath.maximum(arr, vmin, out=out)
if vmax is not None:
arr = np.core.umath.minimum(arr, vmax, out=out)If Your pre-allocation of |
|
I can confirm the bad performance of umath.clip on my Windows, conda, numpy 1.20.1 system: If you don't like to strip out the |
|
For vmin and vmax being None, I was thinking along the lines of:
Consider: data = np.arange(10)
clipped = clip_array(data, None, None, out=None)
data[:] = 0Would it be surprising that clipped now contains zeros? |
|
I noticed that, but I was wondering how friendly we need to be for the None, None case. Well, how about this then? def clip_array(arr, vmin, vmax, out=None):
if vmin is not None:
arr = np.core.umath.maximum(arr, vmin, out=out)
out = arr # do not allocate again if clipping vmax
if vmax is not None:
arr = np.core.umath.minimum(arr, vmax, out=out)
out = arr
if out is None:
out = arr.copy()
return out |
I'm not sure what the performance difference here is, but would it be worth having something like def clip_array_nonlinux(arr, vmin, vmax, out=None):
# current implemenation
def clip_array_linux(arr, vmin, vmax, out=None):
# implementation using np.core.umath.clip
if platform == "linux":
clip_array = clip_array_linux
else:
clip_array = clip_array_nonlinux |
|
On the same AMD machine running Windows and WSL2 Ubuntu: data = np.random.random(1000000).astype(np.float32)
out = np.ones_like(data)
%timeit np.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.minimum(np.core.umath.maximum(data, 0.1, out), 0.9, out)Windows (np 1.20.1): 14.7ms, 14.3ms, 6ms Release notes for NumPy 1.20 say that there is wider use of SIMD for ufuncs. |
yields ValueError: One of max or min must be givenSo I don't think people expect a sensible result from passing in both vmin and vmax as None. |
|
This block is generating the following results on various platforms data = np.random.random(1000000).astype(np.float32)
out = np.ones_like(data)
%timeit np.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.clip(data, 0.1, 0.9, out)
%timeit np.core.umath.minimum(np.core.umath.maximum(data, 0.1, out), 0.9, out)Putting some of the values into a table format here. I ran some tests on macOS
I can get some benchmarks on an windows machine with an intel processor (and I can likely do it with WSLv2 as well). |
|
I don't think it's worth trying to optimize for older versions of numpy. If pyqtgraph users really need the best performance at their disposal, they can update to the current version of numpy IMO. |
|
It's interesting that in #1617, we got measurement results and concluded that on macOS, it was better off to use It would seem that taken by itself in isolation, the workaround in |
|
Dear @pijyoi , I had a potential elegant use for checking top and bottom clips separately in PlotDataItem, and having the clipping function just "fall through" to the original data when neither requires clipping. But I agree that sticking to the np.clip behavior is a good idea. Since checking for I hope we will be able to strip out the helper function again once the current slowdown issue in np.clip is removed. My understanding was that there is some issue with checking old/new implementations of NaN? |
|
@NilsNemitz , I think I have gotten mixed up with 2 different slow-downs. This is my current understanding.
Hence, the current implementation of clip_array is penalizing Linux and macOS performance. I will fix this together with the vmin=vmax=None issue.
The question is: are you able to perceive or measure the slowdown? Is np.clip the predominant cause of slowdown in your function? If not, sticking to np.clip should be just fine. |
|
Dear @pijyoi , To start with the important bit: ** As such, I personally have no performance concerns about just using np.clip(). ** I can confirm the np.clip problem in the tests I showed above. where I see a 4x to 5x slowdown for short data sets (slow startup), and a 3x slow-down for very long data sets (overall slowdown). This is relative to your min() // max()) code. I don't know if @j9ac9k 's measurement can be compared across platforms, or if this involves different systems or emulation. If the numbers can be compared, it looks like the Linux / Mac OSX systems are so much faster overall, that the relative slowness of the min // max approach doesn't really matter: Even the "slow" version outperforms the "fast" Windows version. If true, it might acceptable to optimize for the terribly broken Windows implementation, and hope to remove the workaround again as soon as the numpy guys figure out how to fix this. What do you think? |
|
@NilsNemitz I have created #1649 to address the issues.
I uploaded some before and after timing records in #1648. For at least Linux Intel, the penalty is non-negligible. |
|
Grrr.... that is in the "clipToView" code. Sorry, my brain went from "clipping the data to the view" to assuming that the clips actuallly work on the data. But that just uses np.clip for min/max checks. That has never been ideal. How urgently do we need to revert that? Is it just a little slower? The time sink in clipToView would be the Do you have a favorite style for clipping scalars? In the code I wrote for the dynamic range limiter, I am indeed clipping arrays. Thank you for the explanation and measurements of the slow-downs! |
Nothing urgent. An additional 2 lines was added to make it support scalars. However, a scalar input will get a single-element array returned. https://github.com/pyqtgraph/pyqtgraph/blob/master/pyqtgraph/functions.py#L1017-L1018 The intention was only to support arrays. Ironically, it was the manual allocation in the original version that made it support scalars!
Personally, I would wrap the above snippet in a python function because I have not trained myself to recognize it by sight. |
This PR refactors the np.clip regression workaround in #1632 into pg.clip_array.
It documents why np.core.umath.clip is not used (info from #1617)