Optimize makeARGB for ubyte images by pijyoi · Pull Request #1617 · pyqtgraph/pyqtgraph

pijyoi · 2021-03-02T12:45:29Z

This PR optimizes the channel reordering portion of makeARGB by using the conversion routines of QImage.

It only optimizes for row major ubyte images and defers to the original code for the rest. Improvement in fps is measurable with VideoSpeedTest

Would such a PR that optimizes for specific data types be acceptable?
Tagging @outofculture

outofculture · 2021-03-02T20:35:30Z

Would such a PR that optimizes for specific data types be acceptable?

Yes! I feel like the long-term plan for makeARGB should be something like "determine the shape of the data+args, send off to a purpose-built function". The general nature of that function has made optimization too complicated for me to think about.

j9ac9k · 2021-03-02T21:11:25Z

Would such a PR that optimizes for specific data types be acceptable?

Yes! I feel like the long-term plan for makeARGB should be something like "determine the shape of the data+args, send off to a purpose-built function". The general nature of that function has made optimization too complicated for me to think about.

Looking at the function, I see a few places for some optimizations (not counting numba, which is 100% on the table IMO); but I'm not feeling comfortable making a go at it until

We're confident that the current function returns the correct results
We have more parametrized test cases

outofculture · 2021-03-04T02:01:38Z

Huge improvement, by the way. Using 2048x2048, my system sees a 4.5x (py3.7+qt5) and 8x (py3.9+pyside2) improvement. ( Aside: I should really set up ASV to consolidate and standardize these measurements )

Adding a Scale (--levels 0,255) or LUT drops the improvement to 1.5x. It's still an improvement, so no complaints. I don't grok the nature of this PR sufficiently to guess why it loses so much with the LUT.

pijyoi · 2021-03-04T02:27:03Z

Specifying levels (even if equal to the dtype min,max) will trigger #1590.

pijyoi · 2021-03-14T20:44:07Z

Rebased to include #1630 and #1632
#1630 makes it such that adding "--levels vmin,vmax" will not trigger #1590. If vmin,vmax are the dtype limits, then rescale will get bypassed entirely.
#1632 improves scaling performance for all dtypes.

j9ac9k · 2021-03-14T20:52:06Z

@outofculture I would review this but given the complexity and how much this impact your work, figure I'll give you the honors haha.

@pijyoi thanks for keeping at this.

outofculture · 2021-03-15T05:52:59Z

examples/VideoSpeedTest.py

+            data = xp.empty((frames,) + frame_shape, dtype=dt)
+            view = data.reshape((-1,) + chan_shape)
+            for idx in range(view.shape[0]):
+                subdata = xp.random.normal(loc=loc, scale=scale, size=chan_shape)


This omits the gaussian blurring of the previous code. I'm not sure if that was in there for any important reason. I read through the history, and it has always been a part of this code, and never with explanation. It does visibly alter the image, which is maybe more characteristic of e.g. microscope images. I didn't notice any fps differences either way, so it might be safest to leave it in place.

The gaussian blurring is the function that slows down array creation the most. This becomes observable for larger matrices like 2048x2048. Large matrices are needed to make the profiling timings more measurable. It also makes changing the size on-the-fly from the GUI to be very slow, which makes it harder for experimentation.

It doesn't affect the frame rate. But it makes experimenting easier.

That's legit. Could you leave something in as a comment for now? I'll dig a bit further to see if there's ever a good cause for smoothing things first.

I took a look at pg.gaussianFilter(). I counted at least 4 temporary copies of the input data would be in-flight within that function. So for a 4096x4096 uint8 array, the initial random float64 data would be 128MB. While passing through pg.gaussianFilter(), another 512MB would be allocated.
This would be rather surprising to the user who thought his array was only 16MB.

outofculture · 2021-03-15T07:07:37Z

pyqtgraph/functions.py

+        ain_view = ain[row_beg:row_end, ...]
+        aout_view = aout[row_beg:row_end, ...]
+        qimg = QtGui.QImage(ain_view, ncols, ain_view.shape[0], ain.strides[0], in_fmt)
+        qimg = qimg.convertToFormat(out_fmt)


Nothing else in makeARGB does batching, though maybe more should. How did you decide on the batch size of 512KPixels? Does this perform well in Windows? ( I'm unable to access my Windows-based test systems today. ) Is this a special case that makes batching particularly valuable, or is it worth exploring this strategy elsewhere?

The primary purpose here of the batching was to restrict the amount of extra memory used by going through the fast path due to QImage.convertToFormat() returning a new image.
So for example, without batching, if the input data was 4096x4096 uint8 (16MB), the ARGB output of QImage.convertToFormat() would result in the (temporary) consumption of another 64MB. The idea was that the user should not have to pay for more memory usage.

In this case, being cache-friendly was a secondary consideration. However 512KBytes was chosen to be a value that hopefully fit within common processor L2 cache-sizes. It's not too small that the looping overhead becomes dominant. But I would agree with you if it seems like a magic value.

As I said, here, it's less for performance than for reducing memory usage. After the speed up here, there were much larger bottlenecks elsewhere that would dwarf any minor gains from further tweaking.

I was trying out this batching strategy within rescaleData() where it seemed that the multiple operations operating on the same data would benefit, until I found that np.clip() was the issue. I will revisit the batching there some other time.

In my opinion, the benefit has to be large enough to warrant the addition code complexity / ugliness.

Thanks for explaining all that.

Batching might have a pretty substantial performance impact in this case. On a whim, I played with the batch size while processing a 3072x3072 test, and saw:

32-128k: 80fps
256-1024k: 90fps
2048k: 70fps
no batching: 50fps

So at least by that measure, you picked a pretty good magic number! I've only got 256KB L2 cache here, so I don't know how this performs under different conditions.

Regarding memory concerns, I just noticed these docs, which says it'll do the conversion in place. It works to replace the last line referenced above with:

qimg.convertTo(out_fmt)

Batching might still be desired here; my system still reports a significantly higher fps when batching at 512k. Yet, I also just now observed that the PySide2 jitter and slowness goes away when batching is disabled. So maybe batching is bad somehow? Ugh, this is still confusing me.

looks like convertTo was introduced in Qt 5.13; as we still support 5.12, this might be problematic (but we can certainly incorporate it within the library, and give 5.12 folks an alternative path, even if it's slower)

http://doc.qt.io/qt-5/qimage.html#convertTo

pijyoi · 2021-03-16T14:18:32Z

I implemented rescaleData_blocked() and tested it on Windows (AMD octa-core) and Linux (Intel dual-core).
VideoSpeedTest.py --size=1024x1024 --levels=1,254

Variation 1: using umath.minimum() and umath.maximum()
Variation 2: using umath.clip()

a) rescaleData_blocked() using Variation 1 did not perform any better. (Windows nor Linux)
b) rescaleData_blocked() using Variation 2 did not perform any better on Windows but performed measurably better on Linux.
c) rescaleData() performed similarly whether using Variation 1 or Variation 2. (Windows and Linux)

The (b) observation might be due to the different CPUs used for Windows vs Linux.

rescaleData_blocked() is not limited to ubyte images but it does assume that the data is row-major since it loops on the 1st dimension.

j9ac9k · 2021-03-16T15:44:33Z

rescaleData_blocked() is not limited to ubyte images but it does assume that the data is row-major since it loops on the 1st dimension.

I've been thinking about this looping, would it be feasible to make use of np.apply_along_axis(func1d, 1 if pg.getConfigOption('imageAxisOrder') == 'col-major' else 0)

https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html#numpy.apply_along_axis

Sorry for the drive by comment, looking at the loops, and handling the transpose, I've always thought this might be a good candidate for apply_along_axis and potentially regain some vectorization.

EDIT: just realized we would want np.apply_over_axis due to ndim being 3.

pijyoi · 2021-03-16T22:58:58Z

Thanks @j9ac9k for the input! It reminded me of np.nditer. Now the batching code doesn't look ugly any more and C and F order are handled without any effort. rescaleData_blocked could fully replace rescaleData.

I noticed that the batching size used was 8192 elements, so seems like it is targeting L1 cache size instead.

j9ac9k · 2021-03-16T23:20:41Z

Thanks @j9ac9k for the input! It reminded me of np.nditer. Now the batching code doesn't look ugly any more and C and F order are handled without any effort. rescaleData_blocked could fully replace rescaleData.

I noticed that the batching size used was 8192 elements, so seems like it is targeting L1 cache size instead.

oh wow, that is a lot cleaner, you're right.

outofculture · 2021-03-17T04:31:08Z

I'm having trouble quantifying something that showed up while I was testing this (sorry I'm not a better QA person). When using PySide2 to process a 3072-square image, the reported fps is high, but my perceived fps is much, much lower, and inconsistent, at that. PyQt5 (90fps) has smoothly flickering static. Pyside2 (180fps) has a jerky, slow mess. This behavior goes away when I'm on master (20fps). This behavior goes away when I increase my image size to 4096-square (110fps) or decrease it to 512-square (1050fps). This behavior goes away when I use cupy (110fps). This is all conducted on linux, and if need be I can dump all relevant versions and hw stats here.

My one thought is that this is just an artifact of the framerate being higher than the actual refresh rate of my screen. That's not exactly corroborated by the other measurements, which are nearly all higher than 60fps. I still want to put some research into what happens when I vsync throttle this operation, but I wanted to alert you to this behavior, in the mean time.

We're also going to need to come up with a better way of measuring perceived fps, eventually.

j9ac9k · 2021-03-17T04:42:27Z

I'm having trouble quantifying something that showed up while I was testing this (sorry I'm not a better QA person).

Is this repeatable? I wouldn't mind seeing if I can give this a go on my local machines; I have a high refresh rate monitor for my laptop external display, so easy for me to see if things are different on my 60 fps display vs. my 144 hz display.

We're also going to need to come up with a better way of measuring perceived fps, eventually.

Not sure if this would help; but I was experimenting w/ a rolling average for computing FPS using the deque module; I create a deque, set a max size (like 1,000), and keep appending to it. Every ~50 frames (probably a better way of doing it) I calculate the average elapsed times, and update the FPS display with the new value.

You can see my implementation here:
https://github.com/pyqtgraph/pyqtgraph/pull/1610/files

outofculture · 2021-03-17T05:17:04Z

Is this repeatable?

I don't have any development systems right now except for my linux laptop, so I don't know. I'm just running python examples/VideoSpeedTest.py --size 3072x3072 for all these numbers. Play with sizes, and maybe aim for getting some of the same reported fps values I was seeing?

I was experimenting w/ a rolling average for computing FPS using the deque module

That might help, but I'm worried we'll have to measure things from outside of python to get a definitive result.

pijyoi · 2021-03-17T05:26:42Z

I switched umath.clip() back to umath.{minimum(),maximum()} in rescaleData_blocked. On another Windows machine, umath.clip() is slower (visibly lower frame rate value) despite comments on numpy/numpy#14281 saying otherwise.

I think the part that adds a lot of variation to the timings from frame to frame is that rescaleData API returns a freshly allocated array.

outofculture · 2021-03-17T05:47:15Z

To investigate, I added an ugly throttle to ImageItem.render:

        now = datetime.now()
        if self._lastRender is None or (now - self._lastRender).total_seconds() >= (1 / 60):
            self._lastRender = now
        else:
            sleep(1 / 120)
            return

With this code in place, perceived fps was smooth and fast.

I'm not entirely sure what to do with this information, though. We should probably try to throttle some function or functions to happen not much more than vsync, regardless, as anything more than that is just wasting cpu time.

One other note, I also checked that using PySide2 on master with a 1024-square image does not produce the jittery slow speeds, despite running at about the same reported fps as the 3072-square gets on this branch. This would suggest that it's not PySide2 always misbehaving at certain framerates.

pijyoi · 2021-03-17T05:55:37Z

A script just to time rescaleData vs rescaleData_blocked

import time
import numpy as np

from pyqtgraph import functions

shape = 3072, 3072
data = np.random.randint(0, 255, size=shape, dtype=np.uint8)
params = data, 1.1, 1, np.uint8
niters = 10

for scale in [1.1, 1.5]:
    params = data, scale, 1, np.uint8

    t0 = time.perf_counter()
    for _ in range(niters):
        output = functions.rescaleData(*params)
    t1 = time.perf_counter()
    print('A', scale, t1 - t0)

    t2 = time.perf_counter()
    for _ in range(niters):
        output = functions.rescaleData_blocked(*params)
    t3 = time.perf_counter()
    print('B', scale, t3 - t2)

I got:

A 1.1 0.7497713000000001
B 1.1 0.6815833000000002
A 1.5 1.0996170999999997
B 1.5 1.0091117

So rescaleData_blocked is ~10% faster than rescaleData.
What was unexpected was that the clipping timing depends on the number of elements that actually got clipped.
(the larger scale value of 1.5 causes more elements to need clipping)

For comparison, the timings for rescaleData when it was using np.clip

A 1.1 1.8461401000000002
A 1.5 2.1559471999999995

pijyoi · 2021-03-17T06:47:26Z

After running the script on Ubuntu 20.04 and Windows 10, switching between using umath.clip vs umath.{minimum,maximum}, these are the findings (both using numpy 1.20.1):

Using MINMAX on Windows or Linux yields the same observation that clipping timings depend on the number of elements that actually got clipped.

On Windows, using umath.clip yields much worse performance than using MINMAX. Timings are similar to using np.clip.

On Linux, using umath.clip yields much better performance than using MINMAX. In addition, clipping timings do not depend on the number of elements that actually got clipped.

j9ac9k · 2021-03-17T06:56:09Z

@pijyoi is the performance difference big enough that we should try to utilize the optimal method on each platform? I can get some benchmarks on macOS.

pijyoi · 2021-03-17T07:19:42Z

@pijyoi is the performance difference big enough that we should try to utilize the optimal method on each platform? I can get some benchmarks on macOS.

The difference is significant. On Linux, for the scale 1.1, it was 2x. For the scale 1.5, it was 3x.

pijyoi · 2021-03-17T07:49:29Z

I have added in the code to use umath.clip on Linux. @j9ac9k , if you can test for macOS, that would be great.

pijyoi · 2021-03-17T10:22:10Z

I think I understand why clipping in NumPy is slow relative to the other operations within rescaleData().
We perform clipping on float data. Clipping then involves comparisons of floats. To be NaN correct, NumPy will precede every float comparison with a NaN check.
https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/clip.c.src

On platform runtimes where NaN check is expensive, this will greatly slow down float comparison operations.
Note that NumPy int comparisons are much faster.

In normal C/C++ code, we usually "know" when the data doesn't have NaNs; however, there's no mechanism to let NumPy know that. E.g our source data was uint8 or uint16, thus there wouldn't be NaNs.

j9ac9k · 2021-03-17T14:27:11Z

The difference is significant. On Linux, for the scale 1.1, it was 2x. For the scale 1.5, it was 3x.

In that case, we should probably add a clip method to functions.py that breaks up the code completion path based on platform.

I have added in the code to use umath.clip on Linux. @j9ac9k , if you can test for macOS, that would be great.

I'll try and get to this later today.

j9ac9k · 2021-03-17T16:36:13Z

@pijyoi Running your benchmark code on macOS 11.2.3 with numpy 1.19.5 and with 1.20.1,

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux']

A 1.1 0.4938556839999999
B 1.1 0.446205808
A 1.5 0.46635472700000014
B 1.5 0.47485443699999985

On 1.20.1, A and B for 1.5 scale were effectively equivalent.

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux', 'darwin']

A 1.1 0.506588347
B 1.1 0.5065373409999999
A 1.5 0.47844122499999986
B 1.5 0.8859278949999998

Let me know if I should be testing something a bit different.

pijyoi · 2021-03-17T20:32:12Z

QImage.convertTo() is "in-place" shallowly only. A new buffer is allocated and the QImage instance is internally pointed to it.

…

On Thu, 18 Mar 2021, 03:32 Ogi Moore, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pyqtgraph/functions.py <#1617 (comment)>: > + out_fmt = Format.Format_ARGB32 + + if in_fmt == out_fmt: + aout[:] = ain + return True + + npixels_chunk = 512*1024 + batch = int(npixels_chunk / ncols / nchans) + batch = max(1, batch) + row_beg = 0 + while row_beg < nrows: + row_end = min(row_beg + batch, nrows) + ain_view = ain[row_beg:row_end, ...] + aout_view = aout[row_beg:row_end, ...] + qimg = QtGui.QImage(ain_view, ncols, ain_view.shape[0], ain.strides[0], in_fmt) + qimg = qimg.convertToFormat(out_fmt) looks like convertTo was introduced in Qt 5.13; as we still support 5.12, this might be problematic (but we can certainly incorporate it within the library, and give 5.12 folks an alternative path, even if it's slower) http://doc.qt.io/qt-5/qimage.html#convertTo — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1617 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAUIWA2KI3EQT54IUBP3RPLTED7TTANCNFSM4YO5EGIQ> .

pijyoi · 2021-03-18T01:05:36Z

On 1.20.1, A and B for 1.5 scale were effectively equivalent.

So, when using umath.{minimum,maximum},

Windows and Linux clip timings are affected by the number of elements that actually need clipping.
macOS clip timings are unaffected.

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux', 'darwin']
A 1.1 0.506588347
B 1.1 0.5065373409999999
A 1.5 0.47844122499999986
B 1.5 0.8859278949999998

Could I verify if the (approximately) same timings were obtained if you ran the benchmark a few times?

I have added a small-integer clipping path to rescaleData_blocked(). It runs faster for Windows. Could you please also try that on macOS?

j9ac9k · 2021-03-18T04:11:19Z

On 1.20.1, A and B for 1.5 scale were effectively equivalent.

So, when using umath.{minimum,maximum},

Windows and Linux clip timings are affected by the number of elements that actually need clipping.

macOS clip timings are unaffected.

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux', 'darwin']
A 1.1 0.506588347
B 1.1 0.5065373409999999
A 1.5 0.47844122499999986
B 1.5 0.8859278949999998

Could I verify if the (approximately) same timings were obtained if you ran the benchmark a few times?

I pulled your branch, re-added darwin to the list of platforms to run the fast_umath_clip and got the following results:

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.4658261450000001
B 1.1 0.467011818
A 1.5 0.43373908400000016
B 1.5 0.7629562230000002

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.45256862400000003
B 1.1 0.42464828099999996
A 1.5 0.42674245099999997
B 1.5 0.711061419

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.46632524099999995
B 1.1 0.428802864
A 1.5 0.430931647
B 1.5 0.7163260119999999

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯

removing darwin from the list of platforms to run fast_umath_clip (in other words, running as you have it on your branch)🔥

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.487717837
B 1.1 0.192059957
A 1.5 0.4247874890000001
B 1.5 0.2040620019999999

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.449667276
B 1.1 0.191115272
A 1.5 0.42235494299999987
B 1.5 0.19903449500000003

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.473250743
B 1.1 0.18906075700000002
A 1.5 0.419320012
B 1.5 0.19165177300000003

...that's quite the improvement.

pyqtgraph/functions.py

j9ac9k · 2021-03-21T06:10:21Z

This LGTM, @outofculture you have anything to add?

outofculture · 2021-03-22T16:55:07Z

I still can't explain the jittery, slow behavior I see only on this branch, but am I the only person seeing that? If I am the only system experiencing it, and it's only for images of a certain size, it starts to feel like it's just a weird problem on my system. My fear is that this problem will be more widespread for our users, or that it will pop up later and be hard to track back here. Long-term, we need to come up with benchmarks that measure the actual visible framerate, and benchmarks that can capture and quantify "jittery". I'll work on those, and hopefully at some later time we can measure this problem, bisect to point at its cause, and address it then.

Otherwise, this branch is excellent optimizations.

j9ac9k · 2021-03-23T05:27:36Z

I have replicated the issue @outofculture is describing on macOS ~~/pyside 5.15.2/python3.8 (will test more configurations shortly)~~ all configurations I have locally (tried switching bindings from pyqt5 5.12-6.0, pyside2 5.12-6.0, numpy 1.17.5 to 1.20.1, and python 3.7, 3.8 and 3.9). This occurred on my high refresh rate external display and 60 fps native display. Only thing I have to check is Windows.

The best way I can describe it is there are effectively two images; and one of the two images is shown the bulk of the time, but occasionally the other image is shown momentarily as a flicker.

fastpath_argb

Screen.Recording.2021-03-22.at.10.36.36.PM.mov

master

Screen.Recording.2021-03-22.at.10.37.48.PM.mov

Command line usage I used:

python examples/VideoSpeedTest.py --size 3072x3072

Edit: it goes without saying I'm happy to test alternate configurations and so on, doubly so if it turns out that this issue is most noticeable on macOS.

pijyoi · 2021-03-23T06:33:14Z

I have added a commit to #1648 that draws the white arrows at different positions for different frames. This makes the effect easier to see.

From my testing on Windows, the culprit seems to be related to disabling vsync in #1599.

if RawImageGLWidget is not None:
    # don't limit frame rate to vsync
    sfmt = QtGui.QSurfaceFormat()
    sfmt.setSwapInterval(0)
    QtGui.QSurfaceFormat.setDefaultFormat(sfmt)

j9ac9k · 2021-03-23T15:44:31Z

I commented out the vsync code, and added the code from the arrows, but it largely looks the same. Here is another video in case it helps.

Screen.Recording.2021-03-23.at.8.34.53.AM.mov

Not sure it helps, but on the current master branch, I took the change you made with the arrows; and I commented out this bit:

# if RawImageGLWidget is not None:
#     # don't limit frame rate to vsync
#     sfmt = QtGui.QSurfaceFormat()
#     sfmt.setSwapInterval(0)
#     QtGui.QSurfaceFormat.setDefaultFormat(sfmt)

and reran the test and got this result

Screen.Recording.2021-03-23.at.8.42.37.AM.mov

pijyoi · 2021-03-23T20:25:36Z

Some remarks:

The disable vsync code only kicks in when OpenGL is enabled.

On the fastpath_argb branch, you are running with OpenGL disabled. (i.e. vsync is left enabled by default anyway)
You are getting 70 fps. The frame rate is high enough that multiple arrows seem to appear on screen at the same time.

On the master branch, you are running with OpenGL enabled. (vsync is enabled because you commented out the code)
You are getting 13 fps. The frame rate is slow enough that each different frame is distinct. (Only one arrow appears at a time.)
What happens if you were to reduce the size of the frame here such that the frame rate increases? Does the effect appear on master branch then?

j9ac9k · 2021-03-23T20:45:03Z

Hi @pijyoi sorry I was tinkering around with a bunch of variations; and unintentionally added far more variables into the equation here, last thing I meant to do was to add to the noise.

Just so I give helpful information, what configurations are you wanting me to test?

Also testing configurations with OpenGL is a bit tricky on macOS Big Sur (what I'm running) it's doable but I have to be running effectively a brand new python version (3.9.1+).

If I understand your comment right, you'd like me to report the fps on the master branch, with openGL enabled, and the vsync code commented out, with smaller image sizes?

pijyoi · 2021-03-23T21:26:43Z

On Windows, just having a RawGLImageWidget instantiated (i.e. be selectable in the VideoSpeedTest GUI but not actually selected) affects the frame rate. On Windows, this slows down the frame rate for the other 2 widgets. This is true whether vsync is disabled or not. With vsync on, this caps all 3 widgets to the monitor refresh rate.
This doesn't seem to happen on Linux. I don't know what to expect on macOS.

It would be good to establish if the effect is already present on master branch.
I believe that fastpath_argb just makes it more visible by:

not using the gaussian filter
increasing the frame rate

So using a lower image size is just a means to increase the frame rate on master branch. To simplify things, you could just disable OpenGL. The fps value isn't that important, the question is whether the jitter (?) effect already exists on master branch.

If I understand your comment right, you'd like me to report the fps on the master branch, with openGL enabled, and the vsync code commented out, with smaller image sizes?

OpenGL disabled. Vsync code has no effect. Smaller image size.

j9ac9k · 2021-03-23T23:45:23Z

OpenGL disabled. Vsync code has no effect. Smaller image size.

master

fastpath_argb

j9ac9k · 2021-03-23T23:47:22Z

I was just looking at this a bit more; and maybe the gitter effect isn't as noticeable on the master branch due to the blurring (which I assume is the gaussian filter at work?)

outofculture · 2021-03-23T23:47:57Z

I haven't been able to reproduce this effect on master. I wouldn't trust that finding completely, though; I want a way to programmatically detect this behavior, so I can run bisects and try extensive combinations of image sizes and options. I haven't been able to come up with a way to do that, though.

1. not using the gaussian filter

Yeah, I've been commenting the gaussian filter out on master when I run tests there.

2. increasing the frame rate

I made attempts with smaller image sizes on master, such that the reported framerates were the same (~150fps) as when I experienced this behavior on fastpath_argb. Really small images (256x256) on master produce as similar effect, but not nearly as consistent or dramatic. It's also reporting much higher fps (1240ish), so it's enough to feel like anything that test shows could be for something else entirely.

Maybe it only happens on large-ish images if the fps is also high?

pijyoi · 2021-03-24T03:20:35Z

I added d7ca338 to #1648 that draws a moving white line that should be easier to visualize.

I think the issue is:

we are displaying 3 frames in round-robin only (by default)
if you blit at a frame rate that is not a (constant) factor of the monitor refresh, certain frames don't get a chance to be actually painted on screen.

Something like the immobile helicopter blades in this video.
https://www.youtube.com/watch?v=VXJ0u3ZNdNg

j9ac9k · 2021-03-24T03:31:12Z

@pijyoi I can see the white line flickering and moving up

rescale_blocked branch with 1024x1024 image

Screen.Recording.2021-03-23.at.8.29.21.PM.mov

rescale_blocked branch with 3072x3072 image

Screen.Recording.2021-03-23.at.8.30.17.PM.mov

NilsNemitz · 2021-03-24T03:47:05Z

I guess the line would be flickering because you have more pixels in the image than on the screen.
I think the question is if

the line moves smoothly
the line ever jumps back and forth
the line crosses the picture in a time that matches resolution and FPS.
@pijyoi , maybe make a wider white bar so that the (nearest neighbor?) sampling doesn't cause on/off flickering ?

P.S.: For fun, pan the image during the animation. I find it very satisfying that it doesn't stop updating while it is "floating" :)

pijyoi · 2021-03-24T06:13:26Z

maybe make a wider white bar so that the (nearest neighbor?) sampling doesn't cause on/off flickering ?

Another way that doesn't require code change may be to use non-square images, say 10000x512. (I found out that the .ui file limits the maximum value to 10000)

random.normal() generates as float64 and gets converted to a smaller dtype. generating all the needed data in a single call thus uses a lot more memory than is necessary. this changes it such that smaller chunks are generated. data clipping is also changed to be in-place. the gaussian filtering which gave the video a washed-out look is also removed. this also contributed to data generation time.

outofculture · 2021-03-25T22:43:04Z

Okay, I have finally reproduced this behavior on master! It took using non-square images, weirdly. 7501x128 on my system (on master, using pyside2) shows the jitter. 128x7501 does not. Gah. This is probably an important clue about the ultimate cause. There's also a 3x difference in reported fps between those two scenarios, so the jitter might still just be tied to fps.

Well, whatever, I'm willing to take this as sufficient evidence that the problem exists regardless of the changes in this branch. Sorry to have held up an otherwise excellent PR.

pijyoi · 2021-03-25T23:34:21Z

Master branch's VideoSpeedTest leaves image axis order as default of column major. This results in cases where the processing source buffer is F-contiguous while the destination buffer is C-contiguous. This really exercises the memory system and causes seemingly hard to explain variations in timings.

j9ac9k · 2021-03-26T14:42:13Z

If we now have confirmation that this PR doesn't cause new breakage, is this ready to merge?

pijyoi · 2021-03-26T20:06:29Z

Yes, ready to merge.

…

On Fri, 26 Mar 2021, 22:42 Ogi Moore, ***@***.***> wrote: If we now have confirmation that this PR doesn't cause new breakage, is this ready to merge? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1617 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAUIWAYPMUHAL46EHU4BKKLTFSMNRANCNFSM4YO5EGIQ> .

j9ac9k · 2021-03-26T21:22:47Z

@pijyoi I know I'm sounding like a broken record, but thanks for working on this. As you know, makeARGB is a bottleneck for a lot of folks; so now doubt you have made a positive change for the better here.

Merging changes!

pijyoi force-pushed the fastpath_argb branch from 43a50c3 to 198cb82 Compare March 7, 2021 07:50

pijyoi force-pushed the fastpath_argb branch from 198cb82 to 8adb166 Compare March 14, 2021 20:23

outofculture reviewed Mar 15, 2021

View reviewed changes

j9ac9k reviewed Mar 21, 2021

View reviewed changes

pyqtgraph/functions.py Show resolved Hide resolved

pijyoi mentioned this pull request Mar 21, 2021

implement rescaleData_blocked #1648

Merged

pijyoi added 4 commits March 26, 2021 04:07

fix: max value of uint16 is 65535

cbafc77

switch to row-major order

de85a23

implement fastpath for makeARGB

b3dc090

pijyoi force-pushed the fastpath_argb branch from edd9fce to b3dc090 Compare March 25, 2021 20:07

outofculture approved these changes Mar 26, 2021

View reviewed changes

j9ac9k merged commit 4dc76ed into pyqtgraph:master Mar 26, 2021

pijyoi deleted the fastpath_argb branch March 30, 2021 06:33

Uh oh!

Conversation

pijyoi commented Mar 2, 2021

Uh oh!

outofculture commented Mar 2, 2021

Uh oh!

j9ac9k commented Mar 2, 2021

Uh oh!

outofculture commented Mar 4, 2021

Uh oh!

pijyoi commented Mar 4, 2021

Uh oh!

pijyoi commented Mar 14, 2021

Uh oh!

j9ac9k commented Mar 14, 2021

Uh oh!

outofculture Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pijyoi Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

outofculture Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

pijyoi Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

outofculture Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

pijyoi Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

outofculture Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

outofculture Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

j9ac9k Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pijyoi commented Mar 16, 2021

Uh oh!

j9ac9k commented Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pijyoi commented Mar 16, 2021

Uh oh!

j9ac9k commented Mar 16, 2021

Uh oh!

outofculture commented Mar 17, 2021

Uh oh!

j9ac9k commented Mar 17, 2021

Uh oh!

outofculture commented Mar 17, 2021

Uh oh!

pijyoi commented Mar 17, 2021

Uh oh!

outofculture commented Mar 17, 2021

Uh oh!

pijyoi commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pijyoi commented Mar 17, 2021

Uh oh!

j9ac9k commented Mar 17, 2021

Uh oh!

pijyoi commented Mar 17, 2021

Uh oh!

pijyoi commented Mar 17, 2021

Uh oh!

pijyoi commented Mar 17, 2021

Uh oh!

j9ac9k commented Mar 17, 2021

Uh oh!

j9ac9k commented Mar 17, 2021

outofculture Mar 15, 2021 •

edited

Loading

j9ac9k Mar 17, 2021 •

edited

Loading

j9ac9k commented Mar 16, 2021 •

edited

Loading

pijyoi commented Mar 17, 2021 •

edited

Loading

j9ac9k commented Mar 18, 2021 •

edited

Loading

j9ac9k commented Mar 23, 2021 •

edited

Loading

pijyoi commented Mar 23, 2021 •

edited

Loading

pijyoi commented Mar 24, 2021 •

edited

Loading

NilsNemitz commented Mar 24, 2021 •

edited

Loading