Skip to content

Optimize makeARGB for ubyte images#1617

Merged
j9ac9k merged 4 commits intopyqtgraph:masterfrom
pijyoi:fastpath_argb
Mar 26, 2021
Merged

Optimize makeARGB for ubyte images#1617
j9ac9k merged 4 commits intopyqtgraph:masterfrom
pijyoi:fastpath_argb

Conversation

@pijyoi
Copy link
Copy Markdown
Contributor

@pijyoi pijyoi commented Mar 2, 2021

This PR optimizes the channel reordering portion of makeARGB by using the conversion routines of QImage.

It only optimizes for row major ubyte images and defers to the original code for the rest. Improvement in fps is measurable with VideoSpeedTest

Would such a PR that optimizes for specific data types be acceptable?
Tagging @outofculture

@outofculture
Copy link
Copy Markdown
Contributor

Would such a PR that optimizes for specific data types be acceptable?

Yes! I feel like the long-term plan for makeARGB should be something like "determine the shape of the data+args, send off to a purpose-built function". The general nature of that function has made optimization too complicated for me to think about.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 2, 2021

Would such a PR that optimizes for specific data types be acceptable?

Yes! I feel like the long-term plan for makeARGB should be something like "determine the shape of the data+args, send off to a purpose-built function". The general nature of that function has made optimization too complicated for me to think about.

Looking at the function, I see a few places for some optimizations (not counting numba, which is 100% on the table IMO); but I'm not feeling comfortable making a go at it until

  1. We're confident that the current function returns the correct results
  2. We have more parametrized test cases

@outofculture
Copy link
Copy Markdown
Contributor

Huge improvement, by the way. Using 2048x2048, my system sees a 4.5x (py3.7+qt5) and 8x (py3.9+pyside2) improvement. ( Aside: I should really set up ASV to consolidate and standardize these measurements )

Adding a Scale (--levels 0,255) or LUT drops the improvement to 1.5x. It's still an improvement, so no complaints. I don't grok the nature of this PR sufficiently to guess why it loses so much with the LUT.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 4, 2021

Specifying levels (even if equal to the dtype min,max) will trigger #1590.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 14, 2021

Rebased to include #1630 and #1632
#1630 makes it such that adding "--levels vmin,vmax" will not trigger #1590. If vmin,vmax are the dtype limits, then rescale will get bypassed entirely.
#1632 improves scaling performance for all dtypes.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 14, 2021

@outofculture I would review this but given the complexity and how much this impact your work, figure I'll give you the honors haha.

@pijyoi thanks for keeping at this.

data = xp.empty((frames,) + frame_shape, dtype=dt)
view = data.reshape((-1,) + chan_shape)
for idx in range(view.shape[0]):
subdata = xp.random.normal(loc=loc, scale=scale, size=chan_shape)
Copy link
Copy Markdown
Contributor

@outofculture outofculture Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This omits the gaussian blurring of the previous code. I'm not sure if that was in there for any important reason. I read through the history, and it has always been a part of this code, and never with explanation. It does visibly alter the image, which is maybe more characteristic of e.g. microscope images. I didn't notice any fps differences either way, so it might be safest to leave it in place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gaussian blurring is the function that slows down array creation the most. This becomes observable for larger matrices like 2048x2048. Large matrices are needed to make the profiling timings more measurable. It also makes changing the size on-the-fly from the GUI to be very slow, which makes it harder for experimentation.

It doesn't affect the frame rate. But it makes experimenting easier.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's legit. Could you leave something in as a comment for now? I'll dig a bit further to see if there's ever a good cause for smoothing things first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at pg.gaussianFilter(). I counted at least 4 temporary copies of the input data would be in-flight within that function. So for a 4096x4096 uint8 array, the initial random float64 data would be 128MB. While passing through pg.gaussianFilter(), another 512MB would be allocated.
This would be rather surprising to the user who thought his array was only 16MB.

ain_view = ain[row_beg:row_end, ...]
aout_view = aout[row_beg:row_end, ...]
qimg = QtGui.QImage(ain_view, ncols, ain_view.shape[0], ain.strides[0], in_fmt)
qimg = qimg.convertToFormat(out_fmt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else in makeARGB does batching, though maybe more should. How did you decide on the batch size of 512KPixels? Does this perform well in Windows? ( I'm unable to access my Windows-based test systems today. ) Is this a special case that makes batching particularly valuable, or is it worth exploring this strategy elsewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary purpose here of the batching was to restrict the amount of extra memory used by going through the fast path due to QImage.convertToFormat() returning a new image.
So for example, without batching, if the input data was 4096x4096 uint8 (16MB), the ARGB output of QImage.convertToFormat() would result in the (temporary) consumption of another 64MB. The idea was that the user should not have to pay for more memory usage.

In this case, being cache-friendly was a secondary consideration. However 512KBytes was chosen to be a value that hopefully fit within common processor L2 cache-sizes. It's not too small that the looping overhead becomes dominant. But I would agree with you if it seems like a magic value.

As I said, here, it's less for performance than for reducing memory usage. After the speed up here, there were much larger bottlenecks elsewhere that would dwarf any minor gains from further tweaking.

I was trying out this batching strategy within rescaleData() where it seemed that the multiple operations operating on the same data would benefit, until I found that np.clip() was the issue. I will revisit the batching there some other time.

In my opinion, the benefit has to be large enough to warrant the addition code complexity / ugliness.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining all that.

Batching might have a pretty substantial performance impact in this case. On a whim, I played with the batch size while processing a 3072x3072 test, and saw:

32-128k: 80fps
256-1024k: 90fps
2048k: 70fps
no batching: 50fps

So at least by that measure, you picked a pretty good magic number! I've only got 256KB L2 cache here, so I don't know how this performs under different conditions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding memory concerns, I just noticed these docs, which says it'll do the conversion in place. It works to replace the last line referenced above with:

        qimg.convertTo(out_fmt)

Batching might still be desired here; my system still reports a significantly higher fps when batching at 512k. Yet, I also just now observed that the PySide2 jitter and slowness goes away when batching is disabled. So maybe batching is bad somehow? Ugh, this is still confusing me.

Copy link
Copy Markdown
Member

@j9ac9k j9ac9k Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like convertTo was introduced in Qt 5.13; as we still support 5.12, this might be problematic (but we can certainly incorporate it within the library, and give 5.12 folks an alternative path, even if it's slower)

http://doc.qt.io/qt-5/qimage.html#convertTo

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 16, 2021

I implemented rescaleData_blocked() and tested it on Windows (AMD octa-core) and Linux (Intel dual-core).
VideoSpeedTest.py --size=1024x1024 --levels=1,254

Variation 1: using umath.minimum() and umath.maximum()
Variation 2: using umath.clip()

a) rescaleData_blocked() using Variation 1 did not perform any better. (Windows nor Linux)
b) rescaleData_blocked() using Variation 2 did not perform any better on Windows but performed measurably better on Linux.
c) rescaleData() performed similarly whether using Variation 1 or Variation 2. (Windows and Linux)

The (b) observation might be due to the different CPUs used for Windows vs Linux.

rescaleData_blocked() is not limited to ubyte images but it does assume that the data is row-major since it loops on the 1st dimension.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 16, 2021

rescaleData_blocked() is not limited to ubyte images but it does assume that the data is row-major since it loops on the 1st dimension.

I've been thinking about this looping, would it be feasible to make use of np.apply_along_axis(func1d, 1 if pg.getConfigOption('imageAxisOrder') == 'col-major' else 0)

https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html#numpy.apply_along_axis

Sorry for the drive by comment, looking at the loops, and handling the transpose, I've always thought this might be a good candidate for apply_along_axis and potentially regain some vectorization.

EDIT: just realized we would want np.apply_over_axis due to ndim being 3.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 16, 2021

Thanks @j9ac9k for the input! It reminded me of np.nditer. Now the batching code doesn't look ugly any more and C and F order are handled without any effort. rescaleData_blocked could fully replace rescaleData.

I noticed that the batching size used was 8192 elements, so seems like it is targeting L1 cache size instead.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 16, 2021

Thanks @j9ac9k for the input! It reminded me of np.nditer. Now the batching code doesn't look ugly any more and C and F order are handled without any effort. rescaleData_blocked could fully replace rescaleData.

I noticed that the batching size used was 8192 elements, so seems like it is targeting L1 cache size instead.

oh wow, that is a lot cleaner, you're right.

@outofculture
Copy link
Copy Markdown
Contributor

I'm having trouble quantifying something that showed up while I was testing this (sorry I'm not a better QA person). When using PySide2 to process a 3072-square image, the reported fps is high, but my perceived fps is much, much lower, and inconsistent, at that. PyQt5 (90fps) has smoothly flickering static. Pyside2 (180fps) has a jerky, slow mess. This behavior goes away when I'm on master (20fps). This behavior goes away when I increase my image size to 4096-square (110fps) or decrease it to 512-square (1050fps). This behavior goes away when I use cupy (110fps). This is all conducted on linux, and if need be I can dump all relevant versions and hw stats here.

My one thought is that this is just an artifact of the framerate being higher than the actual refresh rate of my screen. That's not exactly corroborated by the other measurements, which are nearly all higher than 60fps. I still want to put some research into what happens when I vsync throttle this operation, but I wanted to alert you to this behavior, in the mean time.

We're also going to need to come up with a better way of measuring perceived fps, eventually.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 17, 2021

I'm having trouble quantifying something that showed up while I was testing this (sorry I'm not a better QA person).

Is this repeatable? I wouldn't mind seeing if I can give this a go on my local machines; I have a high refresh rate monitor for my laptop external display, so easy for me to see if things are different on my 60 fps display vs. my 144 hz display.

We're also going to need to come up with a better way of measuring perceived fps, eventually.

Not sure if this would help; but I was experimenting w/ a rolling average for computing FPS using the deque module; I create a deque, set a max size (like 1,000), and keep appending to it. Every ~50 frames (probably a better way of doing it) I calculate the average elapsed times, and update the FPS display with the new value.

You can see my implementation here:
https://github.com/pyqtgraph/pyqtgraph/pull/1610/files

@outofculture
Copy link
Copy Markdown
Contributor

Is this repeatable?

I don't have any development systems right now except for my linux laptop, so I don't know. I'm just running python examples/VideoSpeedTest.py --size 3072x3072 for all these numbers. Play with sizes, and maybe aim for getting some of the same reported fps values I was seeing?

I was experimenting w/ a rolling average for computing FPS using the deque module

That might help, but I'm worried we'll have to measure things from outside of python to get a definitive result.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021

I switched umath.clip() back to umath.{minimum(),maximum()} in rescaleData_blocked. On another Windows machine, umath.clip() is slower (visibly lower frame rate value) despite comments on numpy/numpy#14281 saying otherwise.

I think the part that adds a lot of variation to the timings from frame to frame is that rescaleData API returns a freshly allocated array.

@outofculture
Copy link
Copy Markdown
Contributor

To investigate, I added an ugly throttle to ImageItem.render:

        now = datetime.now()
        if self._lastRender is None or (now - self._lastRender).total_seconds() >= (1 / 60):
            self._lastRender = now
        else:
            sleep(1 / 120)
            return

With this code in place, perceived fps was smooth and fast.

I'm not entirely sure what to do with this information, though. We should probably try to throttle some function or functions to happen not much more than vsync, regardless, as anything more than that is just wasting cpu time.

One other note, I also checked that using PySide2 on master with a 1024-square image does not produce the jittery slow speeds, despite running at about the same reported fps as the 3072-square gets on this branch. This would suggest that it's not PySide2 always misbehaving at certain framerates.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021

A script just to time rescaleData vs rescaleData_blocked

import time
import numpy as np

from pyqtgraph import functions

shape = 3072, 3072
data = np.random.randint(0, 255, size=shape, dtype=np.uint8)
params = data, 1.1, 1, np.uint8
niters = 10

for scale in [1.1, 1.5]:
    params = data, scale, 1, np.uint8

    t0 = time.perf_counter()
    for _ in range(niters):
        output = functions.rescaleData(*params)
    t1 = time.perf_counter()
    print('A', scale, t1 - t0)

    t2 = time.perf_counter()
    for _ in range(niters):
        output = functions.rescaleData_blocked(*params)
    t3 = time.perf_counter()
    print('B', scale, t3 - t2)

I got:

A 1.1 0.7497713000000001
B 1.1 0.6815833000000002
A 1.5 1.0996170999999997
B 1.5 1.0091117

So rescaleData_blocked is ~10% faster than rescaleData.
What was unexpected was that the clipping timing depends on the number of elements that actually got clipped.
(the larger scale value of 1.5 causes more elements to need clipping)

For comparison, the timings for rescaleData when it was using np.clip

A 1.1 1.8461401000000002
A 1.5 2.1559471999999995

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021

After running the script on Ubuntu 20.04 and Windows 10, switching between using umath.clip vs umath.{minimum,maximum}, these are the findings (both using numpy 1.20.1):

Using MINMAX on Windows or Linux yields the same observation that clipping timings depend on the number of elements that actually got clipped.

On Windows, using umath.clip yields much worse performance than using MINMAX. Timings are similar to using np.clip.

On Linux, using umath.clip yields much better performance than using MINMAX. In addition, clipping timings do not depend on the number of elements that actually got clipped.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 17, 2021

@pijyoi is the performance difference big enough that we should try to utilize the optimal method on each platform? I can get some benchmarks on macOS.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021

@pijyoi is the performance difference big enough that we should try to utilize the optimal method on each platform? I can get some benchmarks on macOS.

The difference is significant. On Linux, for the scale 1.1, it was 2x. For the scale 1.5, it was 3x.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021

I have added in the code to use umath.clip on Linux. @j9ac9k , if you can test for macOS, that would be great.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021

I think I understand why clipping in NumPy is slow relative to the other operations within rescaleData().
We perform clipping on float data. Clipping then involves comparisons of floats. To be NaN correct, NumPy will precede every float comparison with a NaN check.
https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/clip.c.src

On platform runtimes where NaN check is expensive, this will greatly slow down float comparison operations.
Note that NumPy int comparisons are much faster.

In normal C/C++ code, we usually "know" when the data doesn't have NaNs; however, there's no mechanism to let NumPy know that. E.g our source data was uint8 or uint16, thus there wouldn't be NaNs.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 17, 2021

The difference is significant. On Linux, for the scale 1.1, it was 2x. For the scale 1.5, it was 3x.

In that case, we should probably add a clip method to functions.py that breaks up the code completion path based on platform.

I have added in the code to use umath.clip on Linux. @j9ac9k , if you can test for macOS, that would be great.

I'll try and get to this later today.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 17, 2021

@pijyoi Running your benchmark code on macOS 11.2.3 with numpy 1.19.5 and with 1.20.1,

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux']

A 1.1 0.4938556839999999
B 1.1 0.446205808
A 1.5 0.46635472700000014
B 1.5 0.47485443699999985

On 1.20.1, A and B for 1.5 scale were effectively equivalent.

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux', 'darwin']

A 1.1 0.506588347
B 1.1 0.5065373409999999
A 1.5 0.47844122499999986
B 1.5 0.8859278949999998

Let me know if I should be testing something a bit different.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 17, 2021 via email

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 18, 2021

On 1.20.1, A and B for 1.5 scale were effectively equivalent.

So, when using umath.{minimum,maximum},

  • Windows and Linux clip timings are affected by the number of elements that actually need clipping.
  • macOS clip timings are unaffected.

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux', 'darwin']
A 1.1 0.506588347
B 1.1 0.5065373409999999
A 1.5 0.47844122499999986
B 1.5 0.8859278949999998

Could I verify if the (approximately) same timings were obtained if you ran the benchmark a few times?

I have added a small-integer clipping path to rescaleData_blocked(). It runs faster for Windows. Could you please also try that on macOS?

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 18, 2021

On 1.20.1, A and B for 1.5 scale were effectively equivalent.

So, when using umath.{minimum,maximum},

  • Windows and Linux clip timings are affected by the number of elements that actually need clipping.
  • macOS clip timings are unaffected.

with shape = 3072, 3072 and fast_umath_clip = sys.platform in ['linux', 'darwin']
A 1.1 0.506588347
B 1.1 0.5065373409999999
A 1.5 0.47844122499999986
B 1.5 0.8859278949999998

Could I verify if the (approximately) same timings were obtained if you ran the benchmark a few times?

I pulled your branch, re-added darwin to the list of platforms to run the fast_umath_clip and got the following results:

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.4658261450000001
B 1.1 0.467011818
A 1.5 0.43373908400000016
B 1.5 0.7629562230000002

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.45256862400000003
B 1.1 0.42464828099999996
A 1.5 0.42674245099999997
B 1.5 0.711061419

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.46632524099999995
B 1.1 0.428802864
A 1.5 0.430931647
B 1.5 0.7163260119999999

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯

removing darwin from the list of platforms to run fast_umath_clip (in other words, running as you have it on your branch)🔥

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.487717837
B 1.1 0.192059957
A 1.5 0.4247874890000001
B 1.5 0.2040620019999999

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.449667276
B 1.1 0.191115272
A 1.5 0.42235494299999987
B 1.5 0.19903449500000003

~/Developer/pyqtgraph fastpath_argb*
pyqtgraph-pyqt5_515-py38 ❯ python perf.py
A 1.1 0.473250743
B 1.1 0.18906075700000002
A 1.5 0.419320012
B 1.5 0.19165177300000003

...that's quite the improvement.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 21, 2021

This LGTM, @outofculture you have anything to add?

@outofculture
Copy link
Copy Markdown
Contributor

I still can't explain the jittery, slow behavior I see only on this branch, but am I the only person seeing that? If I am the only system experiencing it, and it's only for images of a certain size, it starts to feel like it's just a weird problem on my system. My fear is that this problem will be more widespread for our users, or that it will pop up later and be hard to track back here. Long-term, we need to come up with benchmarks that measure the actual visible framerate, and benchmarks that can capture and quantify "jittery". I'll work on those, and hopefully at some later time we can measure this problem, bisect to point at its cause, and address it then.

Otherwise, this branch is excellent optimizations.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 23, 2021

I have replicated the issue @outofculture is describing on macOS /pyside 5.15.2/python3.8 (will test more configurations shortly) all configurations I have locally (tried switching bindings from pyqt5 5.12-6.0, pyside2 5.12-6.0, numpy 1.17.5 to 1.20.1, and python 3.7, 3.8 and 3.9). This occurred on my high refresh rate external display and 60 fps native display. Only thing I have to check is Windows.

The best way I can describe it is there are effectively two images; and one of the two images is shown the bulk of the time, but occasionally the other image is shown momentarily as a flicker.

fastpath_argb

Screen.Recording.2021-03-22.at.10.36.36.PM.mov

master

Screen.Recording.2021-03-22.at.10.37.48.PM.mov

Command line usage I used:

python examples/VideoSpeedTest.py --size 3072x3072

Edit: it goes without saying I'm happy to test alternate configurations and so on, doubly so if it turns out that this issue is most noticeable on macOS.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 23, 2021

I have added a commit to #1648 that draws the white arrows at different positions for different frames. This makes the effect easier to see.

From my testing on Windows, the culprit seems to be related to disabling vsync in #1599.

if RawImageGLWidget is not None:
    # don't limit frame rate to vsync
    sfmt = QtGui.QSurfaceFormat()
    sfmt.setSwapInterval(0)
    QtGui.QSurfaceFormat.setDefaultFormat(sfmt)

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 23, 2021

I commented out the vsync code, and added the code from the arrows, but it largely looks the same. Here is another video in case it helps.

Screen.Recording.2021-03-23.at.8.34.53.AM.mov

Not sure it helps, but on the current master branch, I took the change you made with the arrows; and I commented out this bit:

# if RawImageGLWidget is not None:
#     # don't limit frame rate to vsync
#     sfmt = QtGui.QSurfaceFormat()
#     sfmt.setSwapInterval(0)
#     QtGui.QSurfaceFormat.setDefaultFormat(sfmt)

and reran the test and got this result

Screen.Recording.2021-03-23.at.8.42.37.AM.mov

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 23, 2021

Some remarks:

The disable vsync code only kicks in when OpenGL is enabled.

On the fastpath_argb branch, you are running with OpenGL disabled. (i.e. vsync is left enabled by default anyway)
You are getting 70 fps. The frame rate is high enough that multiple arrows seem to appear on screen at the same time.

On the master branch, you are running with OpenGL enabled. (vsync is enabled because you commented out the code)
You are getting 13 fps. The frame rate is slow enough that each different frame is distinct. (Only one arrow appears at a time.)
What happens if you were to reduce the size of the frame here such that the frame rate increases? Does the effect appear on master branch then?

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 23, 2021

Hi @pijyoi sorry I was tinkering around with a bunch of variations; and unintentionally added far more variables into the equation here, last thing I meant to do was to add to the noise.

Just so I give helpful information, what configurations are you wanting me to test?

Also testing configurations with OpenGL is a bit tricky on macOS Big Sur (what I'm running) it's doable but I have to be running effectively a brand new python version (3.9.1+).

If I understand your comment right, you'd like me to report the fps on the master branch, with openGL enabled, and the vsync code commented out, with smaller image sizes?

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 23, 2021

On Windows, just having a RawGLImageWidget instantiated (i.e. be selectable in the VideoSpeedTest GUI but not actually selected) affects the frame rate. On Windows, this slows down the frame rate for the other 2 widgets. This is true whether vsync is disabled or not. With vsync on, this caps all 3 widgets to the monitor refresh rate.
This doesn't seem to happen on Linux. I don't know what to expect on macOS.

It would be good to establish if the effect is already present on master branch.
I believe that fastpath_argb just makes it more visible by:

  1. not using the gaussian filter
  2. increasing the frame rate

So using a lower image size is just a means to increase the frame rate on master branch. To simplify things, you could just disable OpenGL. The fps value isn't that important, the question is whether the jitter (?) effect already exists on master branch.

If I understand your comment right, you'd like me to report the fps on the master branch, with openGL enabled, and the vsync code commented out, with smaller image sizes?

OpenGL disabled. Vsync code has no effect. Smaller image size.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 23, 2021

OpenGL disabled. Vsync code has no effect. Smaller image size.

master

image

fastpath_argb

image

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 23, 2021

I was just looking at this a bit more; and maybe the gitter effect isn't as noticeable on the master branch due to the blurring (which I assume is the gaussian filter at work?)

@outofculture
Copy link
Copy Markdown
Contributor

I haven't been able to reproduce this effect on master. I wouldn't trust that finding completely, though; I want a way to programmatically detect this behavior, so I can run bisects and try extensive combinations of image sizes and options. I haven't been able to come up with a way to do that, though.

1. not using the gaussian filter

Yeah, I've been commenting the gaussian filter out on master when I run tests there.

2. increasing the frame rate

I made attempts with smaller image sizes on master, such that the reported framerates were the same (~150fps) as when I experienced this behavior on fastpath_argb. Really small images (256x256) on master produce as similar effect, but not nearly as consistent or dramatic. It's also reporting much higher fps (1240ish), so it's enough to feel like anything that test shows could be for something else entirely.

Maybe it only happens on large-ish images if the fps is also high?

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 24, 2021

I added d7ca338 to #1648 that draws a moving white line that should be easier to visualize.

I think the issue is:

  1. we are displaying 3 frames in round-robin only (by default)
  2. if you blit at a frame rate that is not a (constant) factor of the monitor refresh, certain frames don't get a chance to be actually painted on screen.

Something like the immobile helicopter blades in this video.
https://www.youtube.com/watch?v=VXJ0u3ZNdNg

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 24, 2021

@pijyoi I can see the white line flickering and moving up

rescale_blocked branch with 1024x1024 image

Screen.Recording.2021-03-23.at.8.29.21.PM.mov

rescale_blocked branch with 3072x3072 image

Screen.Recording.2021-03-23.at.8.30.17.PM.mov

@NilsNemitz
Copy link
Copy Markdown
Contributor

NilsNemitz commented Mar 24, 2021

I guess the line would be flickering because you have more pixels in the image than on the screen.
I think the question is if

  • the line moves smoothly
  • the line ever jumps back and forth
  • the line crosses the picture in a time that matches resolution and FPS.
    @pijyoi , maybe make a wider white bar so that the (nearest neighbor?) sampling doesn't cause on/off flickering ?

P.S.: For fun, pan the image during the animation. I find it very satisfying that it doesn't stop updating while it is "floating" :)

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 24, 2021

maybe make a wider white bar so that the (nearest neighbor?) sampling doesn't cause on/off flickering ?

Another way that doesn't require code change may be to use non-square images, say 10000x512. (I found out that the .ui file limits the maximum value to 10000)

pijyoi added 4 commits March 26, 2021 04:07
random.normal() generates as float64 and gets converted to a smaller
dtype. generating all the needed data in a single call thus uses a lot
more memory than is necessary.

this changes it such that smaller chunks are generated.
data clipping is also changed to be in-place.

the gaussian filtering which gave the video a washed-out look is also
removed. this also contributed to data generation time.
@outofculture
Copy link
Copy Markdown
Contributor

Okay, I have finally reproduced this behavior on master! It took using non-square images, weirdly. 7501x128 on my system (on master, using pyside2) shows the jitter. 128x7501 does not. Gah. This is probably an important clue about the ultimate cause. There's also a 3x difference in reported fps between those two scenarios, so the jitter might still just be tied to fps.

Well, whatever, I'm willing to take this as sufficient evidence that the problem exists regardless of the changes in this branch. Sorry to have held up an otherwise excellent PR.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 25, 2021

Master branch's VideoSpeedTest leaves image axis order as default of column major. This results in cases where the processing source buffer is F-contiguous while the destination buffer is C-contiguous. This really exercises the memory system and causes seemingly hard to explain variations in timings.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 26, 2021

If we now have confirmation that this PR doesn't cause new breakage, is this ready to merge?

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Mar 26, 2021 via email

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Mar 26, 2021

@pijyoi I know I'm sounding like a broken record, but thanks for working on this. As you know, makeARGB is a bottleneck for a lot of folks; so now doubt you have made a positive change for the better here.

Merging changes!

@j9ac9k j9ac9k merged commit 4dc76ed into pyqtgraph:master Mar 26, 2021
@pijyoi pijyoi deleted the fastpath_argb branch March 30, 2021 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants