Add CUB support for argmax() and argmin()#2596
Merged
emcastillo merged 6 commits intocupy:masterfrom Nov 11, 2019
Merged
Conversation
10 tasks
Member
Author
|
As always, below is a performance test on a K40. Script: import cupy as cp
n_runs = 10
shape = (512, 256, 256)
axis_cases = [(0, 1, 2),] # dummy
for dtype in (cp.int64, cp.float32, cp.float64, cp.complex64, cp.complex128):
if dtype in (cp.float32, cp.float64):
x = cp.random.random(shape, dtype=dtype)
elif dtype in (cp.int32, cp.int64):
x = cp.random.randint(0, 10, size=shape, dtype=dtype)
else:
x = cp.random.random(shape).astype(dtype) + 1j * cp.random.random(shape).astype(dtype)
x_np = cp.asnumpy(x) #move to cpu
for axis in axis_cases:
for func in ('argmax', 'argmin'):
keepdims = False
print("testing", axis, "+", str(dtype), "+", "keepdims={}".format(keepdims), "+", func, "...")
start = cp.cuda.Event()
end = cp.cuda.Event()
cp.cuda.cub_enabled = False
w = None
start.record()
for i in range(n_runs):
w = getattr(x, func)()
end.record()
end.synchronize()
t_cp_disabled = cp.cuda.get_elapsed_time(start, end)
cp.cuda.cub_enabled = True
y = None
start.record()
for i in range(n_runs):
y = getattr(x, func)()
end.record()
end.synchronize()
t_cp_enabled = cp.cuda.get_elapsed_time(start, end)
z = None
start.record()
for i in range(n_runs):
z = getattr(x_np, func)()
end.record()
end.synchronize()
t_np = cp.cuda.get_elapsed_time(start, end)
print("CUB enabled: {}, CUB disabled: {}, numpy: {} (ms for {} runs)\n".format(t_cp_enabled, t_cp_disabled, t_np, n_runs))
try:
assert cp.allclose(w, y)
except AssertionError:
print("**************** RESULTS DO NOT MATCH: CUB & reduction ****************")
print(w, y)
try:
assert cp.allclose(y, z)
except AssertionError:
print("**************** RESULTS DO NOT MATCH: CUB & NumPy ****************")
print(y, z)
print()Result: |
emcastillo
reviewed
Nov 8, 2019
5e23f36 to
a347333
Compare
Member
|
Jenkins, test this please |
Collaborator
|
Successfully created a job for commit 534a2ea: |
Member
|
Jenkins CI test (for commit 534a2ea, target branch master) succeeded! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is made easy based on the refactoring in #2562. Part of #2519.
Notes:
axisargument is not supported. In fact, I am a bit reluctant to add CUB support for it because of two reasons:axiscan only be an integer, not a tuple (see Signatures and behaviors ofargmaxandargminare incompatible with NumPy #2595), meaning that only the special caseaxis=-1(searching over the last axis) can be benefited bydevice_segmented_reduceadded in Refactor CUB to support an explicitaxisargument; Fix alignments for Thrust's complex types #2562, which doesn't seem worth the time...device_segmented_reducewould be an array of key-value pairs. I am not sure what's the best way to retrieve the keys (i.e. the wanted array indices). Seems like I need one extra kernel launch to loop data over and copy keys to another device array? Any comment or suggestion is welcome, as I think the core devs should have some experience for how to handle it.(You already did this in
_argmaxand_argmin, although I don't fully understand how it works there.)argmaxandargminare incompatible with NumPy #2595).