Skip to content

Add CUB support for argmax() and argmin()#2596

Merged
emcastillo merged 6 commits intocupy:masterfrom
leofang:cub_argmax_min
Nov 11, 2019
Merged

Add CUB support for argmax() and argmin()#2596
emcastillo merged 6 commits intocupy:masterfrom
leofang:cub_argmax_min

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Nov 1, 2019

This PR is made easy based on the refactoring in #2562. Part of #2519.

Notes:

  1. Currently an explicit axis argument is not supported. In fact, I am a bit reluctant to add CUB support for it because of two reasons:
    • According to the NumPy behavior axis can only be an integer, not a tuple (see Signatures and behaviors of argmax and argmin are incompatible with NumPy #2595), meaning that only the special case axis=-1 (searching over the last axis) can be benefited by device_segmented_reduce added in Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types #2562, which doesn't seem worth the time...
    • On the technical side, the output of device_segmented_reduce would be an array of key-value pairs. I am not sure what's the best way to retrieve the keys (i.e. the wanted array indices). Seems like I need one extra kernel launch to loop data over and copy keys to another device array? Any comment or suggestion is welcome, as I think the core devs should have some experience for how to handle it.
      (You already did this in _argmax and _argmin, although I don't fully understand how it works there.)
  2. The implementation already has the NumPy compatibility in mind (Signatures and behaviors of argmax and argmin are incompatible with NumPy #2595).

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Nov 1, 2019

As always, below is a performance test on a K40.

Script:

import cupy as cp


n_runs = 10
shape = (512, 256, 256)
axis_cases = [(0, 1, 2),]   # dummy

for dtype in (cp.int64, cp.float32, cp.float64, cp.complex64, cp.complex128):
    if dtype in (cp.float32, cp.float64):
        x = cp.random.random(shape, dtype=dtype)
    elif dtype in (cp.int32, cp.int64):
        x = cp.random.randint(0, 10, size=shape, dtype=dtype)
    else:
        x = cp.random.random(shape).astype(dtype) + 1j * cp.random.random(shape).astype(dtype)
    x_np = cp.asnumpy(x) #move to cpu

    for axis in axis_cases:
        for func in ('argmax', 'argmin'):
                keepdims = False
                print("testing", axis, "+", str(dtype), "+", "keepdims={}".format(keepdims), "+", func, "...")
                start = cp.cuda.Event()
                end = cp.cuda.Event()

                cp.cuda.cub_enabled = False
                w = None
                start.record()
                for i in range(n_runs):
                    w = getattr(x, func)()
                end.record()
                end.synchronize()
                t_cp_disabled = cp.cuda.get_elapsed_time(start, end)

                cp.cuda.cub_enabled = True
                y = None
                start.record()
                for i in range(n_runs):
                    y = getattr(x, func)()
                end.record()
                end.synchronize()
                t_cp_enabled = cp.cuda.get_elapsed_time(start, end)

                z = None
                start.record()
                for i in range(n_runs):
                    z = getattr(x_np, func)()
                end.record()
                end.synchronize()
                t_np = cp.cuda.get_elapsed_time(start, end)

                print("CUB enabled: {}, CUB disabled: {}, numpy: {} (ms for {} runs)\n".format(t_cp_enabled, t_cp_disabled, t_np, n_runs))

                try:
                    assert cp.allclose(w, y)
                except AssertionError:
                    print("**************** RESULTS DO NOT MATCH: CUB & reduction ****************")
                    print(w, y)
                try:
                    assert cp.allclose(y, z)
                except AssertionError:
                    print("**************** RESULTS DO NOT MATCH: CUB & NumPy ****************")
                    print(y, z)
        print()

Result:

testing (0, 1, 2) + <class 'numpy.int64'> + keepdims=False + argmax ...
CUB enabled: 14.8439359664917, CUB disabled: 562.7112426757812, numpy: 274.7495422363281 (ms for 10 runs)

testing (0, 1, 2) + <class 'numpy.int64'> + keepdims=False + argmin ...
CUB enabled: 14.812224388122559, CUB disabled: 562.5385131835938, numpy: 225.92294311523438 (ms for 10 runs)


testing (0, 1, 2) + <class 'numpy.float32'> + keepdims=False + argmax ...
CUB enabled: 7.563583850860596, CUB disabled: 475.1551818847656, numpy: 200.45379638671875 (ms for 10 runs)

testing (0, 1, 2) + <class 'numpy.float32'> + keepdims=False + argmin ...
CUB enabled: 7.555967807769775, CUB disabled: 475.4429931640625, numpy: 202.73033142089844 (ms for 10 runs)


testing (0, 1, 2) + <class 'numpy.float64'> + keepdims=False + argmax ...
CUB enabled: 14.66256046295166, CUB disabled: 523.36669921875, numpy: 241.46902465820312 (ms for 10 runs)

testing (0, 1, 2) + <class 'numpy.float64'> + keepdims=False + argmin ...
CUB enabled: 14.680447578430176, CUB disabled: 523.625244140625, numpy: 237.07350158691406 (ms for 10 runs)


testing (0, 1, 2) + <class 'numpy.complex64'> + keepdims=False + argmax ...
CUB enabled: 15.278176307678223, CUB disabled: 626.9349975585938, numpy: 585.0951538085938 (ms for 10 runs)

testing (0, 1, 2) + <class 'numpy.complex64'> + keepdims=False + argmin ...
CUB enabled: 14.97152042388916, CUB disabled: 627.1948852539062, numpy: 589.4969482421875 (ms for 10 runs)


testing (0, 1, 2) + <class 'numpy.complex128'> + keepdims=False + argmax ...
CUB enabled: 29.66339111328125, CUB disabled: 679.6466674804688, numpy: 612.7395629882812 (ms for 10 runs)

testing (0, 1, 2) + <class 'numpy.complex128'> + keepdims=False + argmin ...
CUB enabled: 29.575040817260742, CUB disabled: 680.1937866210938, numpy: 651.1036987304688 (ms for 10 runs)

@emcastillo emcastillo added the cat:performance Performance in terms of speed or memory consumption label Nov 6, 2019
@emcastillo emcastillo added this to the v7.0.0 milestone Nov 6, 2019
Copy link
Copy Markdown
Member

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emcastillo
Copy link
Copy Markdown
Member

Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit 534a2ea:

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 534a2ea, target branch master) succeeded!

@emcastillo emcastillo merged commit bb3ab7a into cupy:master Nov 11, 2019
@leofang leofang deleted the cub_argmax_min branch November 11, 2019 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:performance Performance in terms of speed or memory consumption

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants