Optimize makeARGB for ubyte images#1617
Conversation
Yes! I feel like the long-term plan for makeARGB should be something like "determine the shape of the data+args, send off to a purpose-built function". The general nature of that function has made optimization too complicated for me to think about. |
Looking at the function, I see a few places for some optimizations (not counting numba, which is 100% on the table IMO); but I'm not feeling comfortable making a go at it until
|
|
Huge improvement, by the way. Using 2048x2048, my system sees a 4.5x (py3.7+qt5) and 8x (py3.9+pyside2) improvement. ( Aside: I should really set up ASV to consolidate and standardize these measurements ) Adding a Scale ( |
|
Specifying levels (even if equal to the dtype min,max) will trigger #1590. |
|
@outofculture I would review this but given the complexity and how much this impact your work, figure I'll give you the honors haha. @pijyoi thanks for keeping at this. |
| data = xp.empty((frames,) + frame_shape, dtype=dt) | ||
| view = data.reshape((-1,) + chan_shape) | ||
| for idx in range(view.shape[0]): | ||
| subdata = xp.random.normal(loc=loc, scale=scale, size=chan_shape) |
There was a problem hiding this comment.
This omits the gaussian blurring of the previous code. I'm not sure if that was in there for any important reason. I read through the history, and it has always been a part of this code, and never with explanation. It does visibly alter the image, which is maybe more characteristic of e.g. microscope images. I didn't notice any fps differences either way, so it might be safest to leave it in place.
There was a problem hiding this comment.
The gaussian blurring is the function that slows down array creation the most. This becomes observable for larger matrices like 2048x2048. Large matrices are needed to make the profiling timings more measurable. It also makes changing the size on-the-fly from the GUI to be very slow, which makes it harder for experimentation.
It doesn't affect the frame rate. But it makes experimenting easier.
There was a problem hiding this comment.
That's legit. Could you leave something in as a comment for now? I'll dig a bit further to see if there's ever a good cause for smoothing things first.
There was a problem hiding this comment.
I took a look at pg.gaussianFilter(). I counted at least 4 temporary copies of the input data would be in-flight within that function. So for a 4096x4096 uint8 array, the initial random float64 data would be 128MB. While passing through pg.gaussianFilter(), another 512MB would be allocated.
This would be rather surprising to the user who thought his array was only 16MB.
| ain_view = ain[row_beg:row_end, ...] | ||
| aout_view = aout[row_beg:row_end, ...] | ||
| qimg = QtGui.QImage(ain_view, ncols, ain_view.shape[0], ain.strides[0], in_fmt) | ||
| qimg = qimg.convertToFormat(out_fmt) |
There was a problem hiding this comment.
Nothing else in makeARGB does batching, though maybe more should. How did you decide on the batch size of 512KPixels? Does this perform well in Windows? ( I'm unable to access my Windows-based test systems today. ) Is this a special case that makes batching particularly valuable, or is it worth exploring this strategy elsewhere?
There was a problem hiding this comment.
The primary purpose here of the batching was to restrict the amount of extra memory used by going through the fast path due to QImage.convertToFormat() returning a new image.
So for example, without batching, if the input data was 4096x4096 uint8 (16MB), the ARGB output of QImage.convertToFormat() would result in the (temporary) consumption of another 64MB. The idea was that the user should not have to pay for more memory usage.
In this case, being cache-friendly was a secondary consideration. However 512KBytes was chosen to be a value that hopefully fit within common processor L2 cache-sizes. It's not too small that the looping overhead becomes dominant. But I would agree with you if it seems like a magic value.
As I said, here, it's less for performance than for reducing memory usage. After the speed up here, there were much larger bottlenecks elsewhere that would dwarf any minor gains from further tweaking.
I was trying out this batching strategy within rescaleData() where it seemed that the multiple operations operating on the same data would benefit, until I found that np.clip() was the issue. I will revisit the batching there some other time.
In my opinion, the benefit has to be large enough to warrant the addition code complexity / ugliness.
There was a problem hiding this comment.
Thanks for explaining all that.
Batching might have a pretty substantial performance impact in this case. On a whim, I played with the batch size while processing a 3072x3072 test, and saw:
32-128k: 80fps
256-1024k: 90fps
2048k: 70fps
no batching: 50fps
So at least by that measure, you picked a pretty good magic number! I've only got 256KB L2 cache here, so I don't know how this performs under different conditions.
There was a problem hiding this comment.
Regarding memory concerns, I just noticed these docs, which says it'll do the conversion in place. It works to replace the last line referenced above with:
qimg.convertTo(out_fmt)Batching might still be desired here; my system still reports a significantly higher fps when batching at 512k. Yet, I also just now observed that the PySide2 jitter and slowness goes away when batching is disabled. So maybe batching is bad somehow? Ugh, this is still confusing me.
There was a problem hiding this comment.
looks like convertTo was introduced in Qt 5.13; as we still support 5.12, this might be problematic (but we can certainly incorporate it within the library, and give 5.12 folks an alternative path, even if it's slower)
|
I implemented Variation 1: using a) The (b) observation might be due to the different CPUs used for Windows vs Linux.
|
I've been thinking about this looping, would it be feasible to make use of https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html#numpy.apply_along_axis Sorry for the drive by comment, looking at the loops, and handling the transpose, I've always thought this might be a good candidate for EDIT: just realized we would want |
|
Thanks @j9ac9k for the input! It reminded me of I noticed that the batching size used was 8192 elements, so seems like it is targeting L1 cache size instead. |
oh wow, that is a lot cleaner, you're right. |
|
I'm having trouble quantifying something that showed up while I was testing this (sorry I'm not a better QA person). When using PySide2 to process a 3072-square image, the reported fps is high, but my perceived fps is much, much lower, and inconsistent, at that. PyQt5 (90fps) has smoothly flickering static. Pyside2 (180fps) has a jerky, slow mess. This behavior goes away when I'm on master (20fps). This behavior goes away when I increase my image size to 4096-square (110fps) or decrease it to 512-square (1050fps). This behavior goes away when I use cupy (110fps). This is all conducted on linux, and if need be I can dump all relevant versions and hw stats here. My one thought is that this is just an artifact of the framerate being higher than the actual refresh rate of my screen. That's not exactly corroborated by the other measurements, which are nearly all higher than 60fps. I still want to put some research into what happens when I vsync throttle this operation, but I wanted to alert you to this behavior, in the mean time. We're also going to need to come up with a better way of measuring perceived fps, eventually. |
Is this repeatable? I wouldn't mind seeing if I can give this a go on my local machines; I have a high refresh rate monitor for my laptop external display, so easy for me to see if things are different on my 60 fps display vs. my 144 hz display.
Not sure if this would help; but I was experimenting w/ a rolling average for computing FPS using the You can see my implementation here: |
I don't have any development systems right now except for my linux laptop, so I don't know. I'm just running
That might help, but I'm worried we'll have to measure things from outside of python to get a definitive result. |
|
I switched umath.clip() back to umath.{minimum(),maximum()} in I think the part that adds a lot of variation to the timings from frame to frame is that |
|
To investigate, I added an ugly throttle to now = datetime.now()
if self._lastRender is None or (now - self._lastRender).total_seconds() >= (1 / 60):
self._lastRender = now
else:
sleep(1 / 120)
returnWith this code in place, perceived fps was smooth and fast. I'm not entirely sure what to do with this information, though. We should probably try to throttle some function or functions to happen not much more than vsync, regardless, as anything more than that is just wasting cpu time. One other note, I also checked that using PySide2 on master with a 1024-square image does not produce the jittery slow speeds, despite running at about the same reported fps as the 3072-square gets on this branch. This would suggest that it's not PySide2 always misbehaving at certain framerates. |
|
A script just to time import time
import numpy as np
from pyqtgraph import functions
shape = 3072, 3072
data = np.random.randint(0, 255, size=shape, dtype=np.uint8)
params = data, 1.1, 1, np.uint8
niters = 10
for scale in [1.1, 1.5]:
params = data, scale, 1, np.uint8
t0 = time.perf_counter()
for _ in range(niters):
output = functions.rescaleData(*params)
t1 = time.perf_counter()
print('A', scale, t1 - t0)
t2 = time.perf_counter()
for _ in range(niters):
output = functions.rescaleData_blocked(*params)
t3 = time.perf_counter()
print('B', scale, t3 - t2)I got: So For comparison, the timings for |
|
After running the script on Ubuntu 20.04 and Windows 10, switching between using Using MINMAX on Windows or Linux yields the same observation that clipping timings depend on the number of elements that actually got clipped. On Windows, using On Linux, using |
|
@pijyoi is the performance difference big enough that we should try to utilize the optimal method on each platform? I can get some benchmarks on macOS. |
The difference is significant. On Linux, for the scale 1.1, it was 2x. For the scale 1.5, it was 3x. |
|
I have added in the code to use |
|
I think I understand why clipping in NumPy is slow relative to the other operations within rescaleData(). On platform runtimes where NaN check is expensive, this will greatly slow down float comparison operations. In normal C/C++ code, we usually "know" when the data doesn't have NaNs; however, there's no mechanism to let NumPy know that. E.g our source data was uint8 or uint16, thus there wouldn't be NaNs. |
In that case, we should probably add a
I'll try and get to this later today. |
|
@pijyoi Running your benchmark code on macOS 11.2.3 with numpy 1.19.5 and with 1.20.1, with On 1.20.1, A and B for 1.5 scale were effectively equivalent. with Let me know if I should be testing something a bit different. |
|
QImage.convertTo() is "in-place" shallowly only. A new buffer is allocated
and the QImage instance is internally pointed to it.
…On Thu, 18 Mar 2021, 03:32 Ogi Moore, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pyqtgraph/functions.py
<#1617 (comment)>:
> + out_fmt = Format.Format_ARGB32
+
+ if in_fmt == out_fmt:
+ aout[:] = ain
+ return True
+
+ npixels_chunk = 512*1024
+ batch = int(npixels_chunk / ncols / nchans)
+ batch = max(1, batch)
+ row_beg = 0
+ while row_beg < nrows:
+ row_end = min(row_beg + batch, nrows)
+ ain_view = ain[row_beg:row_end, ...]
+ aout_view = aout[row_beg:row_end, ...]
+ qimg = QtGui.QImage(ain_view, ncols, ain_view.shape[0], ain.strides[0], in_fmt)
+ qimg = qimg.convertToFormat(out_fmt)
looks like convertTo was introduced in Qt 5.13; as we still support 5.12,
this might be problematic (but we can certainly incorporate it within the
library, and give 5.12 folks an alternative path, even if it's slower)
http://doc.qt.io/qt-5/qimage.html#convertTo
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1617 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAUIWA2KI3EQT54IUBP3RPLTED7TTANCNFSM4YO5EGIQ>
.
|
So, when using umath.{minimum,maximum},
Could I verify if the (approximately) same timings were obtained if you ran the benchmark a few times? I have added a small-integer clipping path to rescaleData_blocked(). It runs faster for Windows. Could you please also try that on macOS? |
I pulled your branch, re-added removing ...that's quite the improvement. |
|
This LGTM, @outofculture you have anything to add? |
|
I still can't explain the jittery, slow behavior I see only on this branch, but am I the only person seeing that? If I am the only system experiencing it, and it's only for images of a certain size, it starts to feel like it's just a weird problem on my system. My fear is that this problem will be more widespread for our users, or that it will pop up later and be hard to track back here. Long-term, we need to come up with benchmarks that measure the actual visible framerate, and benchmarks that can capture and quantify "jittery". I'll work on those, and hopefully at some later time we can measure this problem, bisect to point at its cause, and address it then. Otherwise, this branch is excellent optimizations. |
|
I have replicated the issue @outofculture is describing on macOS The best way I can describe it is there are effectively two images; and one of the two images is shown the bulk of the time, but occasionally the other image is shown momentarily as a flicker.
Screen.Recording.2021-03-22.at.10.36.36.PM.mov
Screen.Recording.2021-03-22.at.10.37.48.PM.movCommand line usage I used: python examples/VideoSpeedTest.py --size 3072x3072Edit: it goes without saying I'm happy to test alternate configurations and so on, doubly so if it turns out that this issue is most noticeable on macOS. |
|
I have added a commit to #1648 that draws the white arrows at different positions for different frames. This makes the effect easier to see. From my testing on Windows, the culprit seems to be related to disabling vsync in #1599. if RawImageGLWidget is not None:
# don't limit frame rate to vsync
sfmt = QtGui.QSurfaceFormat()
sfmt.setSwapInterval(0)
QtGui.QSurfaceFormat.setDefaultFormat(sfmt) |
|
I commented out the vsync code, and added the code from the arrows, but it largely looks the same. Here is another video in case it helps. Screen.Recording.2021-03-23.at.8.34.53.AM.movNot sure it helps, but on the current # if RawImageGLWidget is not None:
# # don't limit frame rate to vsync
# sfmt = QtGui.QSurfaceFormat()
# sfmt.setSwapInterval(0)
# QtGui.QSurfaceFormat.setDefaultFormat(sfmt)and reran the test and got this result Screen.Recording.2021-03-23.at.8.42.37.AM.mov |
|
Some remarks: The disable vsync code only kicks in when OpenGL is enabled. On the fastpath_argb branch, you are running with OpenGL disabled. (i.e. vsync is left enabled by default anyway) On the master branch, you are running with OpenGL enabled. (vsync is enabled because you commented out the code) |
|
Hi @pijyoi sorry I was tinkering around with a bunch of variations; and unintentionally added far more variables into the equation here, last thing I meant to do was to add to the noise. Just so I give helpful information, what configurations are you wanting me to test? Also testing configurations with OpenGL is a bit tricky on macOS Big Sur (what I'm running) it's doable but I have to be running effectively a brand new python version (3.9.1+). If I understand your comment right, you'd like me to report the fps on the master branch, with openGL enabled, and the vsync code commented out, with smaller image sizes? |
|
On Windows, just having a RawGLImageWidget instantiated (i.e. be selectable in the VideoSpeedTest GUI but not actually selected) affects the frame rate. On Windows, this slows down the frame rate for the other 2 widgets. This is true whether vsync is disabled or not. With vsync on, this caps all 3 widgets to the monitor refresh rate. It would be good to establish if the effect is already present on master branch.
So using a lower image size is just a means to increase the frame rate on master branch. To simplify things, you could just disable OpenGL. The fps value isn't that important, the question is whether the jitter (?) effect already exists on master branch.
OpenGL disabled. Vsync code has no effect. Smaller image size. |
|
I was just looking at this a bit more; and maybe the gitter effect isn't as noticeable on the master branch due to the blurring (which I assume is the gaussian filter at work?) |
|
I haven't been able to reproduce this effect on master. I wouldn't trust that finding completely, though; I want a way to programmatically detect this behavior, so I can run bisects and try extensive combinations of image sizes and options. I haven't been able to come up with a way to do that, though.
Yeah, I've been commenting the gaussian filter out on master when I run tests there.
I made attempts with smaller image sizes on master, such that the reported framerates were the same (~150fps) as when I experienced this behavior on fastpath_argb. Really small images (256x256) on master produce as similar effect, but not nearly as consistent or dramatic. It's also reporting much higher fps (1240ish), so it's enough to feel like anything that test shows could be for something else entirely. Maybe it only happens on large-ish images if the fps is also high? |
|
I added d7ca338 to #1648 that draws a moving white line that should be easier to visualize. I think the issue is:
Something like the immobile helicopter blades in this video. |
|
@pijyoi I can see the white line flickering and moving up
Screen.Recording.2021-03-23.at.8.29.21.PM.mov
Screen.Recording.2021-03-23.at.8.30.17.PM.mov |
|
I guess the line would be flickering because you have more pixels in the image than on the screen.
P.S.: For fun, pan the image during the animation. I find it very satisfying that it doesn't stop updating while it is "floating" :) |
Another way that doesn't require code change may be to use non-square images, say 10000x512. (I found out that the .ui file limits the maximum value to 10000) |
random.normal() generates as float64 and gets converted to a smaller dtype. generating all the needed data in a single call thus uses a lot more memory than is necessary. this changes it such that smaller chunks are generated. data clipping is also changed to be in-place. the gaussian filtering which gave the video a washed-out look is also removed. this also contributed to data generation time.
|
Okay, I have finally reproduced this behavior on master! It took using non-square images, weirdly. 7501x128 on my system (on master, using pyside2) shows the jitter. 128x7501 does not. Gah. This is probably an important clue about the ultimate cause. There's also a 3x difference in reported fps between those two scenarios, so the jitter might still just be tied to fps. Well, whatever, I'm willing to take this as sufficient evidence that the problem exists regardless of the changes in this branch. Sorry to have held up an otherwise excellent PR. |
|
Master branch's VideoSpeedTest leaves image axis order as default of column major. This results in cases where the processing source buffer is F-contiguous while the destination buffer is C-contiguous. This really exercises the memory system and causes seemingly hard to explain variations in timings. |
|
If we now have confirmation that this PR doesn't cause new breakage, is this ready to merge? |
|
Yes, ready to merge.
…On Fri, 26 Mar 2021, 22:42 Ogi Moore, ***@***.***> wrote:
If we now have confirmation that this PR doesn't cause new breakage, is
this ready to merge?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1617 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAUIWAYPMUHAL46EHU4BKKLTFSMNRANCNFSM4YO5EGIQ>
.
|
|
@pijyoi I know I'm sounding like a broken record, but thanks for working on this. As you know, makeARGB is a bottleneck for a lot of folks; so now doubt you have made a positive change for the better here. Merging changes! |


This PR optimizes the channel reordering portion of makeARGB by using the conversion routines of QImage.
It only optimizes for row major ubyte images and defers to the original code for the rest. Improvement in fps is measurable with VideoSpeedTest
Would such a PR that optimizes for specific data types be acceptable?
Tagging @outofculture