Skip to content

Bypass makeARGB() for uint8, uint16 and float images#1693

Closed
pijyoi wants to merge 42 commits intopyqtgraph:masterfrom
pijyoi:bypass_makeargb
Closed

Bypass makeARGB() for uint8, uint16 and float images#1693
pijyoi wants to merge 42 commits intopyqtgraph:masterfrom
pijyoi:bypass_makeargb

Conversation

@pijyoi
Copy link
Copy Markdown
Contributor

@pijyoi pijyoi commented Apr 5, 2021

This PR supercedes #1668.
8-bit grayscale images and 256-entry lut colormap images can skip makeARGB() entirely by using Format_Grayscale8 and Format_Indexed8 respectively instead of Format_ARGB32.
For such a use-case, levels + lut combination to lut-only becomes an optimization and would be the fastest codepath available.

Sample program to compare performance against master.
Try changing the colormap and changing the levels.

import pyqtgraph as pg
import numpy as np
pg.setConfigOptions(imageAxisOrder='row-major')
app = pg.mkQApp()
imv = pg.ImageView()
imv.show()
size = (8192, 8192)
data = np.random.randint(256, size=size, dtype=np.uint8)
imv.setImage(data)
app.exec_()

@pijyoi pijyoi force-pushed the bypass_makeargb branch from 3f3dff3 to eddb2e2 Compare April 6, 2021 06:14
@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented Apr 6, 2021

I don't have a technical comment here, just wanted to say I appreciate these diffs @pijyoi as with your changes, especially with the comments, I feel like I'm starting to get a better handle on the image components of this library, which is an area I really don't use much in my own projects.

One thing that I had always wondered was how come we couldn't pass the data a bit more directly to the display, and you seem to have recognized that potential optimization

            if is_passthru:
                 # both levels and lut are None
                 # these images are suitable for display directly
                 if image.ndim == 2:
                     fmt = QtGui.QImage.Format.Format_Grayscale8
                 elif image.shape[2] == 3:
                     fmt = QtGui.QImage.Format.Format_RGB888
                 elif image.shape[2] == 4:
                     fmt = QtGui.QImage.Format.Format_RGBA8888

As with most things image related, I'm going to make myself available for testing, but defer to @outofculture for providing much technical feedback.

Thanks again for the PR to further improve image performance within the library.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Apr 8, 2021

The following table documents some useful combinations and the intended outcomes:

  1. if levels are 2d, it will be extra slow path
  2. if levels are 1d and the user supplied an alpha channel, the levels will also get applied onto the alpha channel. this is probably erroneous usage.
  3. multichannel images are not supposed to have user-supplied lut.
  4. table below refers to floats without nans
data channels lvl_1d lut fmt remarks
uint8 1,3,4 N N Grayscale8, RGB888, RGBA8888 respectively
uint8 1 Y N Indexed8
uint8 1 * Y Indexed8
uint8 3 Y N RGB888
uint16 1 N N Grayscale16 if Qt >= 5.13
uint16 3 N N RGB888 handled as levels=[0,65535]
uint16 4 N N RGBA64
uint16 1 Y N Grayscale8
uint16 1 * Y Indexed8 for lut with <= 256 entries
uint16 1 * Y Grayscale8, RGBX8888, RGBA8888 for lut with > 256 entries, levels and colors lut combination kicks in
uint16 3 Y N RGB888
float 1,3,4 Y N Grayscale8, RGB888, RGBA8888 respectively
float 1 Y Y Indexed8 for lut with <= 256 entries
float 1 Y Y Grayscale8, RGBX8888, RGBA8888 for lut with > 256 entries

@pijyoi pijyoi force-pushed the bypass_makeargb branch from b269ee6 to dda5bb2 Compare April 9, 2021 02:30
@pijyoi pijyoi marked this pull request as ready for review April 9, 2021 02:43
@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Apr 9, 2021

It turns out that lut.take(indices) is slower than lut[indices].
As usual, Linux runs faster than Windows on the same machine.

Windows

In [83]: indices = np.random.randint(65536, size=(6144, 10240), dtype=np.uint16)
In [84]: lut = np.random.randint(256, size=65536, dtype=np.uint8)
In [85]: %timeit lut.take(indices)
285 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [86]: %timeit lut[indices]
172 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

WSL2

In [7]: indices = np.random.randint(65536, size=(6144, 10240), dtype=np.uint16)
In [8]: lut = np.random.randint(256, size=65536, dtype=np.uint8)
In [9]: %timeit lut.take(indices)
248 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit lut[indices]
101 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@outofculture
Copy link
Copy Markdown
Contributor

I've previously found that much of the Windows/Linux difference is from anything that might call malloc (or whatever the modern equivalent is). What do your timings look like if you pre-allocate an out=out? Although, I don't actually know how to call lut[indices] with an output; lut.__getitem__(indices) won't accept out=out...

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Apr 9, 2021

Well, we could just test it out for np.take() with and w/o pre-allocation.

Windows

In [101]: out = np.ones(indices.shape, dtype=lut.dtype)
In [103]: %timeit lut.take(indices)
292 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit lut.take(indices, out=out)
300 ms ± 9.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

WSL2

In [11]: out = np.ones(indices.shape, dtype=lut.dtype)
In [12]: %timeit lut.take(indices)
246 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit lut.take(indices, out=out)
247 ms ± 933 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

@outofculture
Copy link
Copy Markdown
Contributor

Huh! I guess different functions are different? I really wouldn't have predicted a slowdown when pre-allocating, though. When I added the _processingBuffer to ImageItem, it had significant impact in Windows, but there's no arguing with observation.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented Apr 10, 2021

You can try this on Windows:
Open the task manager and look at the memory usage of Python process.
While running %timeit lut.take(indices, out=out), note that memory usage increases during the run.
While running %timeit np.add(indices, 0, out=indices), note that memory usage does not increase during the run.

This leads to the conclusion that the out parameter in np.take() is implemented as:

def take_external(a, indices, out=None):
    res = take_internal(a, indices)    # always let numpy allocate
    if out is None:
        out = res
    else:
        out[:] = res
    return out

i.e. the out parameter is purely to have a uniform interface as the rest of the numpy library.

@j9ac9k
Copy link
Copy Markdown
Member

j9ac9k commented May 10, 2021

my goodness @pijyoi .... I'm going to leave it to @outofculture to comment on the specifics of the PR here ... but these performance improvements 😍

Would it be of interest/beneficial to get benchmark info on one of the Apple ARM M1 processors? I do have access to a machine now that I could run the benchmark suite. Qt will be releasing a macOS Universal Binary (native ARM support) version for Qt 6.2, but I believe numpy has native support right now.

While out of scope for this PR, but loosely related, we should add a blurb in the documentation for performance considerations when calling ImageItem.setImage While not all use-cases allow for flexibility here, many do, and it would be good to provide some guidance...

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented May 11, 2021

Would it be of interest/beneficial to get benchmark info on one of the Apple ARM M1 processors? I do have access to a machine now that I could run the benchmark suite. Qt will be releasing a macOS Universal Binary (native ARM support) version for Qt 6.2, but I believe numpy has native support right now.

Not so particularly useful. The render timings table was more a sanity check. That said, rendering timings is a pure computation measurement and doesn't take into account the painting step. The latter, I believe, would be quite dependent on the platform.

While out of scope for this PR, but loosely related, we should add a blurb in the documentation for performance considerations when calling ImageItem.setImage While not all use-cases allow for flexibility here, many do, and it would be good to provide some guidance...

Off-hand (will update if I think of more):

  1. Use row-major.
    • instantiate as ImageItem(axisOrder='row-major') or
    • pg.setConfigOption('imageAxisOrder', 'row-major')
  2. Image.setImage(data, autoLevels=False)
    • autoLevels is an "opt-out" parameter. It defaults to True if levels is not also set
  3. Use C-contiguous image data
  4. Use new version of numpy (1.20 has SIMD improvements, noticeable on Linux platforms)
  5. enable use of numba with pg.setConfigOption('useNumba', True)
    • won't be useful for uint8 image data
    • not useful if you are only displaying 1 image, since JIT overhead is quite large. (and we didn't enable JIT caching on disk)
    • useful for Windows, less useful for Linux with new numpy 1.20
  6. If using floating-point image data, prefer float32 to float64.
    • Many examples use np.random.{random, normal} which return float64
  7. If using color lookup tables, use <= 256 entries
    • the colormaps included with pyqtgraph have 256 entries
    • ImageView widget will however interpolate it to 512 entries for non-uint8 image
  8. Avoid using different levels per channel for RGB images. It is an unoptimized codepath
  9. Avoid having NaNs in float data. It is an unoptimized codepath

@pijyoi pijyoi changed the title Bypass makeARGB() for uint8 and uint16 images Bypass makeARGB() for uint8, uint16 and float images May 12, 2021
@outofculture
Copy link
Copy Markdown
Contributor

Okay, I'm back. Is this ready @pijyoi?

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented May 18, 2021 via email

@outofculture
Copy link
Copy Markdown
Contributor

So, the cupy-enabled benchmarks are showing a ~2-4x slowdown on this branch, depending on argument combinations. I'm having a lot of trouble caring, though, seeing as how the numba times often crush the old cupy times... But no, the main branch's cupy deserves to be protected. @pijyoi it sounds like working on cupy isn't convenient for you? I have time to take this on, if you want to just hand it off to me.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented May 19, 2021

it sounds like working on cupy isn't convenient for you? I have time to take this on, if you want to just hand it off to me.

Yes, please do.
Not only am I not familiar with cupy, the machine on which I do have an Nvidia card is on PCIe 1.0, which is surely not representative of current machines.

@pijyoi
Copy link
Copy Markdown
Contributor Author

pijyoi commented May 19, 2021

I suppose not converting the lut to a cupy ndarray is an oversight, as done in makeARGB():

    if lut is not None and not isinstance(lut, xp.ndarray):
        lut = xp.array(lut)

Both benchmarks/renderImageItem.py and examples/VideoSpeedTest.py create luts in cuda memory, that's why they work.
ImageView.py would probably fail.

@outofculture outofculture mentioned this pull request May 19, 2021
@outofculture
Copy link
Copy Markdown
Contributor

This got merged with #1786

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants