Bypass makeARGB() for uint8, uint16 and float images by pijyoi · Pull Request #1693 · pyqtgraph/pyqtgraph

pijyoi · 2021-04-05T15:02:25Z

This PR supercedes #1668.
8-bit grayscale images and 256-entry lut colormap images can skip makeARGB() entirely by using Format_Grayscale8 and Format_Indexed8 respectively instead of Format_ARGB32.
For such a use-case, levels + lut combination to lut-only becomes an optimization and would be the fastest codepath available.

Sample program to compare performance against master.
Try changing the colormap and changing the levels.

import pyqtgraph as pg
import numpy as np
pg.setConfigOptions(imageAxisOrder='row-major')
app = pg.mkQApp()
imv = pg.ImageView()
imv.show()
size = (8192, 8192)
data = np.random.randint(256, size=size, dtype=np.uint8)
imv.setImage(data)
app.exec_()

pyqtgraph/functions.py

j9ac9k · 2021-04-06T06:21:50Z

I don't have a technical comment here, just wanted to say I appreciate these diffs @pijyoi as with your changes, especially with the comments, I feel like I'm starting to get a better handle on the image components of this library, which is an area I really don't use much in my own projects.

One thing that I had always wondered was how come we couldn't pass the data a bit more directly to the display, and you seem to have recognized that potential optimization

            if is_passthru:
                 # both levels and lut are None
                 # these images are suitable for display directly
                 if image.ndim == 2:
                     fmt = QtGui.QImage.Format.Format_Grayscale8
                 elif image.shape[2] == 3:
                     fmt = QtGui.QImage.Format.Format_RGB888
                 elif image.shape[2] == 4:
                     fmt = QtGui.QImage.Format.Format_RGBA8888

As with most things image related, I'm going to make myself available for testing, but defer to @outofculture for providing much technical feedback.

Thanks again for the PR to further improve image performance within the library.

pijyoi · 2021-04-08T03:33:05Z

The following table documents some useful combinations and the intended outcomes:

if levels are 2d, it will be extra slow path
if levels are 1d and the user supplied an alpha channel, the levels will also get applied onto the alpha channel. this is probably erroneous usage.
multichannel images are not supposed to have user-supplied lut.
table below refers to floats without nans

data	channels	lvl_1d	lut	fmt	remarks
uint8	1,3,4	N	N	Grayscale8, RGB888, RGBA8888 respectively
uint8	1	Y	N	Indexed8
uint8	1	*	Y	Indexed8
uint8	3	Y	N	RGB888
uint16	1	N	N	Grayscale16	if Qt >= 5.13
uint16	3	N	N	RGB888	handled as levels=[0,65535]
uint16	4	N	N	RGBA64
uint16	1	Y	N	Grayscale8
uint16	1	*	Y	Indexed8	for lut with <= 256 entries
uint16	1	*	Y	Grayscale8, RGBX8888, RGBA8888	for lut with > 256 entries, levels and colors lut combination kicks in
uint16	3	Y	N	RGB888
float	1,3,4	Y	N	Grayscale8, RGB888, RGBA8888 respectively
float	1	Y	Y	Indexed8	for lut with <= 256 entries
float	1	Y	Y	Grayscale8, RGBX8888, RGBA8888	for lut with > 256 entries

pijyoi · 2021-04-09T12:55:39Z

It turns out that lut.take(indices) is slower than lut[indices].
As usual, Linux runs faster than Windows on the same machine.

Windows

In [83]: indices = np.random.randint(65536, size=(6144, 10240), dtype=np.uint16)
In [84]: lut = np.random.randint(256, size=65536, dtype=np.uint8)
In [85]: %timeit lut.take(indices)
285 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [86]: %timeit lut[indices]
172 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

WSL2

In [7]: indices = np.random.randint(65536, size=(6144, 10240), dtype=np.uint16)
In [8]: lut = np.random.randint(256, size=65536, dtype=np.uint8)
In [9]: %timeit lut.take(indices)
248 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit lut[indices]
101 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

outofculture · 2021-04-09T16:32:51Z

I've previously found that much of the Windows/Linux difference is from anything that might call malloc (or whatever the modern equivalent is). What do your timings look like if you pre-allocate an out=out? Although, I don't actually know how to call lut[indices] with an output; lut.__getitem__(indices) won't accept out=out...

pijyoi · 2021-04-09T22:55:47Z

Well, we could just test it out for np.take() with and w/o pre-allocation.

Windows

In [101]: out = np.ones(indices.shape, dtype=lut.dtype)
In [103]: %timeit lut.take(indices)
292 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit lut.take(indices, out=out)
300 ms ± 9.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

WSL2

In [11]: out = np.ones(indices.shape, dtype=lut.dtype)
In [12]: %timeit lut.take(indices)
246 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit lut.take(indices, out=out)
247 ms ± 933 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

outofculture · 2021-04-09T23:13:09Z

Huh! I guess different functions are different? I really wouldn't have predicted a slowdown when pre-allocating, though. When I added the _processingBuffer to ImageItem, it had significant impact in Windows, but there's no arguing with observation.

pijyoi · 2021-04-10T01:02:34Z

You can try this on Windows:
Open the task manager and look at the memory usage of Python process.
While running %timeit lut.take(indices, out=out), note that memory usage increases during the run.
While running %timeit np.add(indices, 0, out=indices), note that memory usage does not increase during the run.

This leads to the conclusion that the out parameter in np.take() is implemented as:

def take_external(a, indices, out=None):
    res = take_internal(a, indices)    # always let numpy allocate
    if out is None:
        out = res
    else:
        out[:] = res
    return out

i.e. the out parameter is purely to have a uniform interface as the rest of the numpy library.

Grayscale8 and RGB888 images are those that are ready for display without further processing.

you can index (y, x) into a lookup table of shape (nentry, 3) or (nentry, 4) and get an output of shape (y, x, 3) or (y, x, 4)

This reverts commit 45cf310.

j9ac9k · 2021-05-10T19:38:22Z

my goodness @pijyoi .... I'm going to leave it to @outofculture to comment on the specifics of the PR here ... but these performance improvements 😍

Would it be of interest/beneficial to get benchmark info on one of the Apple ARM M1 processors? I do have access to a machine now that I could run the benchmark suite. Qt will be releasing a macOS Universal Binary (native ARM support) version for Qt 6.2, but I believe numpy has native support right now.

While out of scope for this PR, but loosely related, we should add a blurb in the documentation for performance considerations when calling ImageItem.setImage While not all use-cases allow for flexibility here, many do, and it would be good to provide some guidance...

pijyoi · 2021-05-11T01:05:07Z

Would it be of interest/beneficial to get benchmark info on one of the Apple ARM M1 processors? I do have access to a machine now that I could run the benchmark suite. Qt will be releasing a macOS Universal Binary (native ARM support) version for Qt 6.2, but I believe numpy has native support right now.

Not so particularly useful. The render timings table was more a sanity check. That said, rendering timings is a pure computation measurement and doesn't take into account the painting step. The latter, I believe, would be quite dependent on the platform.

While out of scope for this PR, but loosely related, we should add a blurb in the documentation for performance considerations when calling ImageItem.setImage While not all use-cases allow for flexibility here, many do, and it would be good to provide some guidance...

Off-hand (will update if I think of more):

Use row-major.
- instantiate as ImageItem(axisOrder='row-major') or
- pg.setConfigOption('imageAxisOrder', 'row-major')
Image.setImage(data, autoLevels=False)
- autoLevels is an "opt-out" parameter. It defaults to True if levels is not also set
Use C-contiguous image data
Use new version of numpy (1.20 has SIMD improvements, noticeable on Linux platforms)
enable use of numba with pg.setConfigOption('useNumba', True)
- won't be useful for uint8 image data
- not useful if you are only displaying 1 image, since JIT overhead is quite large. (and we didn't enable JIT caching on disk)
- useful for Windows, less useful for Linux with new numpy 1.20
If using floating-point image data, prefer float32 to float64.
- Many examples use np.random.{random, normal} which return float64
If using color lookup tables, use <= 256 entries
- the colormaps included with pyqtgraph have 256 entries
- ImageView widget will however interpolate it to 512 entries for non-uint8 image
Avoid using different levels per channel for RGB images. It is an unoptimized codepath
Avoid having NaNs in float data. It is an unoptimized codepath

every one of the four branches now does its own return. this makes it easier to follow.

outofculture · 2021-05-18T21:48:54Z

Okay, I'm back. Is this ready @pijyoi?

pijyoi · 2021-05-18T21:52:53Z

Yes it's ready.

…

On Wed, 19 May 2021, 05:49 Martin Chase, ***@***.***> wrote: Okay, I'm back. Is this ready @pijyoi <https://github.com/pijyoi>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1693 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAUIWA2U2VO6KCXMTGHX3BDTOLOFXANCNFSM42M6CTKA> .

benchmarks/renderImageItem.py

outofculture · 2021-05-19T02:21:37Z

So, the cupy-enabled benchmarks are showing a ~2-4x slowdown on this branch, depending on argument combinations. I'm having a lot of trouble caring, though, seeing as how the numba times often crush the old cupy times... But no, the main branch's cupy deserves to be protected. @pijyoi it sounds like working on cupy isn't convenient for you? I have time to take this on, if you want to just hand it off to me.

pijyoi · 2021-05-19T02:29:54Z

it sounds like working on cupy isn't convenient for you? I have time to take this on, if you want to just hand it off to me.

Yes, please do.
Not only am I not familiar with cupy, the machine on which I do have an Nvidia card is on PCIe 1.0, which is surely not representative of current machines.

pijyoi · 2021-05-19T03:12:27Z

I suppose not converting the lut to a cupy ndarray is an oversight, as done in makeARGB():

    if lut is not None and not isinstance(lut, xp.ndarray):
        lut = xp.array(lut)

Both benchmarks/renderImageItem.py and examples/VideoSpeedTest.py create luts in cuda memory, that's why they work.
ImageView.py would probably fail.

outofculture · 2021-05-20T00:23:47Z

This got merged with #1786

pijyoi force-pushed the bypass_makeargb branch from 118f972 to da6d39c Compare April 6, 2021 04:38

pijyoi mentioned this pull request Apr 6, 2021

make ImageItem combine levels + lut not be a pessimization #1668

Closed

j9ac9k reviewed Apr 6, 2021

View reviewed changes

pyqtgraph/functions.py Show resolved Hide resolved

pijyoi force-pushed the bypass_makeargb branch from 3f3dff3 to eddb2e2 Compare April 6, 2021 06:14

pijyoi force-pushed the bypass_makeargb branch from b269ee6 to dda5bb2 Compare April 9, 2021 02:30

pijyoi marked this pull request as ready for review April 9, 2021 02:43

pijyoi force-pushed the bypass_makeargb branch from 602c6b2 to e37046d Compare April 10, 2021 07:37

pijyoi added 14 commits April 13, 2021 03:42

refactor out _ndarray_to_qimage()

4650b66

combine levels back with lut

f8c30eb

make use of Grayscale8, RGB888 and Indexed8 QImage formats

59b835e

Grayscale8 and RGB888 images are those that are ready for display without further processing.

add Grayscale16

38fbb3f

apply the efflut early for uint16 mono/rgb, uint8 rgb

29ad432

ndarray indexing is faster than np.take

f59097a

handle uint16 rgb(a) with no levels same as levels=[0, 65535]

eba9086

add support for Format_RGBA64

941a436

fix: support colormaps of shape (h, 1)

a6bbb1c

check ImageItem uint8 and uint16 QImage formats

81cc4eb

uint16 mono with rgb lut -> RGBX8888

c69c717

got width and height swapped in array dimensions

f24bf09

set ImageItem as row-major

900d674

no need to form a 1d 32-bit lut for array indexing

45cf310

you can index (y, x) into a lookup table of shape (nentry, 3) or (nentry, 4) and get an output of shape (y, x, 3) or (y, x, 4)

pijyoi force-pushed the bypass_makeargb branch from 15a873f to 45cf310 Compare April 12, 2021 19:42

Revert "no need to form a 1d 32-bit lut for array indexing"

c0cea11

This reverts commit 45cf310.

pijyoi added 2 commits May 12, 2021 12:32

bug: applying colors_lut needs C-order

5b5c950

support float with no nans

29681a5

pijyoi changed the title ~~Bypass makeARGB() for uint8 and uint16 images~~ Bypass makeARGB() for uint8, uint16 and float images May 12, 2021

pijyoi added 5 commits May 12, 2021 13:19

fix: variable could be uninitialized

14f10a3

add float32 format tests

e9ee496

avoid explicitly forcing to C-contiguous

7cfe88c

cache effective lut only if combination took place

e5895e8

every one of the four branches now does its own return. this makes it easier to follow.

Merge branch 'master' into bypass_makeargb

6217e82

NilsNemitz mentioned this pull request May 16, 2021

Add axis convenience methods and matrix display example #1726

Merged

outofculture reviewed May 18, 2021

View reviewed changes

benchmarks/renderImageItem.py Show resolved Hide resolved

outofculture reviewed May 18, 2021

View reviewed changes

benchmarks/renderImageItem.py Outdated Show resolved Hide resolved

outofculture reviewed May 18, 2021

View reviewed changes

benchmarks/renderImageItem.py Outdated Show resolved Hide resolved

pijyoi added 4 commits May 19, 2021 07:44

fix cupy benchmark : typo in renderQImage

7b4561e

remove for loop of 1 iteration

bc60fe3

use float32 for floating point benchmark

b31be1c

superceded by renderImageItem.py

c127ac6

outofculture mentioned this pull request May 19, 2021

Bypass makeARGB #1786

Merged

outofculture closed this May 20, 2021

This was referenced May 20, 2021

ImageItem lut + levels combine is currently a pessimization #1590

Closed

implement numba colormap lookup #1794

Merged

pijyoi deleted the bypass_makeargb branch May 23, 2021 07:59

pijyoi mentioned this pull request Mar 23, 2024

add fastpath for float images with nans #2970

Merged

Uh oh!

Conversation

pijyoi commented Apr 5, 2021

Uh oh!

Uh oh!

j9ac9k commented Apr 6, 2021

Uh oh!

pijyoi commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pijyoi commented Apr 9, 2021

Uh oh!

outofculture commented Apr 9, 2021

Uh oh!

pijyoi commented Apr 9, 2021

Uh oh!

outofculture commented Apr 9, 2021

Uh oh!

pijyoi commented Apr 10, 2021

Uh oh!

j9ac9k commented May 10, 2021

Uh oh!

pijyoi commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

outofculture commented May 18, 2021

Uh oh!

pijyoi commented May 18, 2021 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

outofculture commented May 19, 2021

Uh oh!

pijyoi commented May 19, 2021

Uh oh!

pijyoi commented May 19, 2021

Uh oh!

outofculture commented May 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pijyoi commented Apr 8, 2021 •

edited

Loading

pijyoi commented May 11, 2021 •

edited

Loading