Bypass makeARGB() for uint8, uint16 and float images#1693
Bypass makeARGB() for uint8, uint16 and float images#1693pijyoi wants to merge 42 commits intopyqtgraph:masterfrom
Conversation
|
I don't have a technical comment here, just wanted to say I appreciate these diffs @pijyoi as with your changes, especially with the comments, I feel like I'm starting to get a better handle on the image components of this library, which is an area I really don't use much in my own projects. One thing that I had always wondered was how come we couldn't pass the data a bit more directly to the display, and you seem to have recognized that potential optimization if is_passthru:
# both levels and lut are None
# these images are suitable for display directly
if image.ndim == 2:
fmt = QtGui.QImage.Format.Format_Grayscale8
elif image.shape[2] == 3:
fmt = QtGui.QImage.Format.Format_RGB888
elif image.shape[2] == 4:
fmt = QtGui.QImage.Format.Format_RGBA8888As with most things image related, I'm going to make myself available for testing, but defer to @outofculture for providing much technical feedback. Thanks again for the PR to further improve image performance within the library. |
|
The following table documents some useful combinations and the intended outcomes:
|
|
It turns out that Windows In [83]: indices = np.random.randint(65536, size=(6144, 10240), dtype=np.uint16)
In [84]: lut = np.random.randint(256, size=65536, dtype=np.uint8)
In [85]: %timeit lut.take(indices)
285 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [86]: %timeit lut[indices]
172 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)WSL2 In [7]: indices = np.random.randint(65536, size=(6144, 10240), dtype=np.uint16)
In [8]: lut = np.random.randint(256, size=65536, dtype=np.uint8)
In [9]: %timeit lut.take(indices)
248 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit lut[indices]
101 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
|
I've previously found that much of the Windows/Linux difference is from anything that might call malloc (or whatever the modern equivalent is). What do your timings look like if you pre-allocate an |
|
Well, we could just test it out for Windows In [101]: out = np.ones(indices.shape, dtype=lut.dtype)
In [103]: %timeit lut.take(indices)
292 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit lut.take(indices, out=out)
300 ms ± 9.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)WSL2 In [11]: out = np.ones(indices.shape, dtype=lut.dtype)
In [12]: %timeit lut.take(indices)
246 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit lut.take(indices, out=out)
247 ms ± 933 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) |
|
Huh! I guess different functions are different? I really wouldn't have predicted a slowdown when pre-allocating, though. When I added the _processingBuffer to ImageItem, it had significant impact in Windows, but there's no arguing with observation. |
|
You can try this on Windows: This leads to the conclusion that the def take_external(a, indices, out=None):
res = take_internal(a, indices) # always let numpy allocate
if out is None:
out = res
else:
out[:] = res
return outi.e. the |
Grayscale8 and RGB888 images are those that are ready for display without further processing.
you can index (y, x) into a lookup table of shape (nentry, 3) or (nentry, 4) and get an output of shape (y, x, 3) or (y, x, 4)
This reverts commit 45cf310.
|
my goodness @pijyoi .... I'm going to leave it to @outofculture to comment on the specifics of the PR here ... but these performance improvements 😍 Would it be of interest/beneficial to get benchmark info on one of the Apple ARM M1 processors? I do have access to a machine now that I could run the benchmark suite. Qt will be releasing a macOS Universal Binary (native ARM support) version for Qt 6.2, but I believe numpy has native support right now. While out of scope for this PR, but loosely related, we should add a blurb in the documentation for performance considerations when calling |
Not so particularly useful. The render timings table was more a sanity check. That said, rendering timings is a pure computation measurement and doesn't take into account the painting step. The latter, I believe, would be quite dependent on the platform.
Off-hand (will update if I think of more):
|
every one of the four branches now does its own return. this makes it easier to follow.
|
Okay, I'm back. Is this ready @pijyoi? |
|
Yes it's ready.
…On Wed, 19 May 2021, 05:49 Martin Chase, ***@***.***> wrote:
Okay, I'm back. Is this ready @pijyoi <https://github.com/pijyoi>?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1693 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAUIWA2U2VO6KCXMTGHX3BDTOLOFXANCNFSM42M6CTKA>
.
|
|
So, the cupy-enabled benchmarks are showing a ~2-4x slowdown on this branch, depending on argument combinations. I'm having a lot of trouble caring, though, seeing as how the numba times often crush the old cupy times... But no, the main branch's cupy deserves to be protected. @pijyoi it sounds like working on cupy isn't convenient for you? I have time to take this on, if you want to just hand it off to me. |
Yes, please do. |
|
I suppose not converting the lut to a cupy ndarray is an oversight, as done in makeARGB(): if lut is not None and not isinstance(lut, xp.ndarray):
lut = xp.array(lut)Both benchmarks/renderImageItem.py and examples/VideoSpeedTest.py create luts in cuda memory, that's why they work. |
|
This got merged with #1786 |
This PR supercedes #1668.
8-bit grayscale images and 256-entry lut colormap images can skip makeARGB() entirely by using
Format_Grayscale8andFormat_Indexed8respectively instead ofFormat_ARGB32.For such a use-case, levels + lut combination to lut-only becomes an optimization and would be the fastest codepath available.
Sample program to compare performance against master.
Try changing the colormap and changing the levels.