Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types#2562
Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types#2562emcastillo merged 11 commits intocupy:masterfrom
axis argument; Fix alignments for Thrust's complex types#2562Conversation
No segfault or invalid memory access happens in this version, but the result is incorrect. Probably reduced axis is wrong?
Some slowdown (compared to CUB_DISABLED=1) is observed, especially with non-contiguous axes. Oddly, sometimes axis=(1, 2) for ndim=3 is also slower, not sure why...
|
Test script for timing: import cupy as cp
from itertools import permutations
n_runs = 10
shape = (512, 256, 256)
#axis_cases = []
#for r in range(1, len(shape)):
# data = list(permutations((0, 1, 2), r))
# axis_cases += list(set(tuple(sorted(item)) for item in data))
#print(axis_cases)
axis_cases = [(2,), (1, 2)]
for dtype in (cp.int64, cp.float32, cp.float64, cp.complex64, cp.complex128):
if dtype in (cp.float32, cp.float64):
x = cp.random.random(shape, dtype=dtype)
elif dtype in (cp.int32, cp.int64):
x = cp.random.randint(0, 10, size=shape, dtype=dtype)
else:
x = cp.random.random(shape).astype(dtype) + 1j * cp.random.random(shape).astype(dtype)
x_np = cp.asnumpy(x) #move to cpu
for axis in axis_cases:
for func in ('sum', 'max', 'min'):
for keepdims in (False, True):
print("testing", axis, "+", str(dtype), "+", "keepdims={}".format(keepdims), "+", func, "...")
start = cp.cuda.Event()
end = cp.cuda.Event()
cp.cuda.cub_enabled = False
w = None
start.record()
for i in range(n_runs):
w = getattr(x, func)(axis=axis, keepdims=keepdims)
end.record()
end.synchronize()
t_cp_disabled = cp.cuda.get_elapsed_time(start, end)
cp.cuda.cub_enabled = True
y = None
start.record()
for i in range(n_runs):
y = getattr(x, func)(axis=axis, keepdims=keepdims)
end.record()
end.synchronize()
t_cp_enabled = cp.cuda.get_elapsed_time(start, end)
z = None
start.record()
for i in range(n_runs):
z = getattr(x_np, func)(axis=axis, keepdims=keepdims)
end.record()
end.synchronize()
t_np = cp.cuda.get_elapsed_time(start, end)
print("CUB enabled: {}, CUB disabled: {}, numpy: {} (ms for {} runs)\n".format(t_cp_enabled, t_cp_disabled, t_np, n_runs))
try:
assert cp.allclose(y, z)
except AssertionError:
#print(y, z)
print("**************** RESULTS DO NOT MATCH ****************")
print()Results (on a Tesla K40): testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 23.21049690246582, CUB disabled: 97.80284881591797, numpy: 293.8869323730469 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 21.285184860229492, CUB disabled: 96.49353790283203, numpy: 291.5436706542969 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 21.833375930786133, CUB disabled: 238.48565673828125, numpy: 426.699951171875 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 21.72879981994629, CUB disabled: 236.48611450195312, numpy: 339.5528869628906 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 22.16092872619629, CUB disabled: 238.17999267578125, numpy: 340.62969970703125 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 21.669696807861328, CUB disabled: 236.42681884765625, numpy: 341.7230529785156 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 16.709823608398438, CUB disabled: 23.147584915161133, numpy: 204.2683563232422 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 16.697439193725586, CUB disabled: 23.1177921295166, numpy: 202.71197509765625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 16.687583923339844, CUB disabled: 24.119712829589844, numpy: 315.8702392578125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 16.700576782226562, CUB disabled: 24.170143127441406, numpy: 310.73388671875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 16.720800399780273, CUB disabled: 24.135583877563477, numpy: 314.0164794921875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 16.514080047607422, CUB disabled: 24.13929557800293, numpy: 313.4980773925781 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 28.248544692993164, CUB disabled: 92.67865753173828, numpy: 137.1543731689453 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 28.250463485717773, CUB disabled: 91.4178237915039, numpy: 135.07212829589844 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 30.03673553466797, CUB disabled: 129.5340118408203, numpy: 183.66697692871094 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 30.036544799804688, CUB disabled: 128.01181030273438, numpy: 180.21347045898438 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 30.048927307128906, CUB disabled: 129.79168701171875, numpy: 178.98159790039062 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 30.19603157043457, CUB disabled: 128.01644897460938, numpy: 182.43055725097656 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 7.862656116485596, CUB disabled: 19.890527725219727, numpy: 120.35234832763672 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 7.823200225830078, CUB disabled: 19.897472381591797, numpy: 121.12246704101562 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 7.857312202453613, CUB disabled: 24.197856903076172, numpy: 89.19709014892578 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 7.873824119567871, CUB disabled: 24.279071807861328, numpy: 88.42854309082031 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 7.861631870269775, CUB disabled: 24.255168914794922, numpy: 88.70995330810547 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 7.879424095153809, CUB disabled: 24.293344497680664, numpy: 89.95187377929688 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 20.730464935302734, CUB disabled: 91.43590545654297, numpy: 172.49183654785156 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 20.73676872253418, CUB disabled: 89.9533462524414, numpy: 170.43568420410156 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 20.852800369262695, CUB disabled: 236.86781311035156, numpy: 251.40383911132812 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 20.828704833984375, CUB disabled: 235.07571411132812, numpy: 251.66355895996094 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 20.82080078125, CUB disabled: 236.87501525878906, numpy: 253.11004638671875 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 20.833696365356445, CUB disabled: 235.10073852539062, numpy: 253.92617797851562 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 16.528640747070312, CUB disabled: 21.84921646118164, numpy: 160.9127655029297 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 16.55824089050293, CUB disabled: 21.872127532958984, numpy: 160.8041229248047 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 16.505695343017578, CUB disabled: 26.589216232299805, numpy: 169.4571533203125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 16.563007354736328, CUB disabled: 26.587488174438477, numpy: 170.53878784179688 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 16.656415939331055, CUB disabled: 26.599679946899414, numpy: 167.86175537109375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 16.577119827270508, CUB disabled: 26.634592056274414, numpy: 167.5244140625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
test_CUB_reduction_segmented2.py:50: PerformanceWarning: CUB reduction for complex numbers may not be highly performant. If concerned, set cupy.cuda.cub_enabled=False to switch to CuPy's internal reduction routine and compare the timings.
y = getattr(x, func)(axis=axis, keepdims=keepdims)
CUB enabled: 23.10438346862793, CUB disabled: 100.88428497314453, numpy: 256.8711242675781 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 22.03398323059082, CUB disabled: 99.3512954711914, numpy: 254.5529022216797 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 38.65875244140625, CUB disabled: 208.4864044189453, numpy: 756.0455322265625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 38.64329528808594, CUB disabled: 206.7027587890625, numpy: 757.1634521484375 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 39.79123306274414, CUB disabled: 208.6664581298828, numpy: 761.587890625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 39.73331069946289, CUB disabled: 206.92723083496094, numpy: 758.5726928710938 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 39.48934555053711, CUB disabled: 22.78950309753418, numpy: 245.1492462158203 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 38.345375061035156, CUB disabled: 22.80735969543457, numpy: 242.8602294921875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 34.917823791503906, CUB disabled: 35.21001434326172, numpy: 660.6533203125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 35.011104583740234, CUB disabled: 35.20755386352539, numpy: 656.068359375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 35.034175872802734, CUB disabled: 34.48457717895508, numpy: 653.321044921875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 34.84735870361328, CUB disabled: 34.49331283569336, numpy: 651.9085693359375 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 34.13123321533203, CUB disabled: 125.0881576538086, numpy: 332.6822509765625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 34.13324737548828, CUB disabled: 123.69213104248047, numpy: 330.7550964355469 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 53.477088928222656, CUB disabled: 402.5926208496094, numpy: 784.7593383789062 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 53.6071662902832, CUB disabled: 400.9013671875, numpy: 785.705810546875 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 53.289344787597656, CUB disabled: 404.80047607421875, numpy: 798.3240356445312 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 53.289119720458984, CUB disabled: 403.0973205566406, numpy: 785.0084838867188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 77.00498962402344, CUB disabled: 35.658241271972656, numpy: 316.6164855957031 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 76.99961853027344, CUB disabled: 35.57235336303711, numpy: 317.3782958984375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 70.05232238769531, CUB disabled: 42.08127975463867, numpy: 693.8490600585938 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 70.0739517211914, CUB disabled: 42.076255798339844, numpy: 691.0841064453125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 69.90096282958984, CUB disabled: 41.422977447509766, numpy: 691.4490356445312 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 69.95222473144531, CUB disabled: 41.337310791015625, numpy: 690.6539306640625 (ms for 10 runs) |
cupy/core/_routines_statistics.pyx
Outdated
| elif cub.can_use_device_segmented_reduce( | ||
| cub.CUPY_CUB_MAX, self.dtype, self.ndim, axis, dtype): | ||
| if self.dtype in (numpy.complex64, numpy.complex128): | ||
| warnings.warn("CUB reduction for complex numbers may not be " |
There was a problem hiding this comment.
It seems that when dim of axis > 1, and dtype is complex then is better to use the old reduce implementations,
how about not using cub in such cases?
There was a problem hiding this comment.
Note complex64 seems OK, sometimes it’s even better.
|
@emcastillo I might have found the reason for suboptimal performance for complex numbers. Let me run more experiments to confirm and make necessary changes. I'll let you know when it's ready for review. Thanks. |
|
As mentioned in the PR description, I suspect the template<typename ReductionOpT>
struct DeviceReducePolicy<complex<float>, int, ReductionOpT>
{
struct PolicyXXX : ChainedPolicy<600, PolicyXXX, PolicyXXX>
{
typedef AgentReducePolicy<
CUB_SCALED_GRANULARITIES(256, 8, complex<float>), // items per thread is cut in half
4,
BLOCK_REDUCE_WARP_REDUCTIONS,
LOAD_LDG>
ReducePolicy;
// SingleTilePolicy
typedef ReducePolicy SingleTilePolicy;
// SegmentedReducePolicy
typedef ReducePolicy SegmentedReducePolicy;
};
/// MaxPolicy
typedef PolicyXXX MaxPolicy;
};I got a bunch of long compiler errors like this: So I am unable to confirm if an alternative policy helps or not. @emcastillo any comment or suggestion? Thanks. |
|
I guess you already know it but have you tried to put the general template before your specialization? That error appears in such case |
|
Thanks a lot for sharing your thoughts, @emcastillo! Any help is appreciated here! It is really weird to get worse performance for complex numbers with this segmented approach. Without The I also tried a full specialization like |
|
I decided to tweak the CUB source directly to bypass the template specialization error. It seems on K40 the timing doesn't change by changing slightly the values in One odd thing I noticed: K40 supports up to SM35, but Now I'm experimenting looping over the array when the segment size is too large (empirically choosing 256) and calling device reduce. |
|
I can test it on V and P100 |
|
Results of your branch on V100 testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 4.321887969970703, CUB disabled: 10.010751724243164, numpy: 242.3883819580078 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 3.544543981552124, CUB disabled: 8.817376136779785, numpy: 238.95021057128906 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 3.5083839893341064, CUB disabled: 15.883711814880371, numpy: 273.5198974609375 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 3.390144109725952, CUB disabled: 15.018112182617188, numpy: 273.7151794433594 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 3.4464640617370605, CUB disabled: 15.858176231384277, numpy: 274.9762268066406 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 3.4031999111175537, CUB disabled: 15.016863822937012, numpy: 275.6973876953125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 3.331455945968628, CUB disabled: 3.756608009338379, numpy: 238.29052734375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 3.4645440578460693, CUB disabled: 3.7578558921813965, numpy: 232.7453155517578 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 3.3107519149780273, CUB disabled: 3.8134400844573975, numpy: 251.29388427734375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 3.2954559326171875, CUB disabled: 3.7946560382843018, numpy: 251.232421875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 3.3486080169677734, CUB disabled: 3.811392068862915, numpy: 250.1583709716797 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 3.2591679096221924, CUB disabled: 3.7887039184570312, numpy: 251.49017333984375 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 3.431936025619507, CUB disabled: 9.059328079223633, numpy: 105.90569305419922 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 3.2592320442199707, CUB disabled: 8.244640350341797, numpy: 105.4345932006836 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 3.634687900543213, CUB disabled: 11.441311836242676, numpy: 170.00799560546875 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 3.673151969909668, CUB disabled: 10.797183990478516, numpy: 173.68313598632812 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 3.8667519092559814, CUB disabled: 12.006815910339355, numpy: 172.03225708007812 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 3.691296100616455, CUB disabled: 10.80246353149414, numpy: 169.44837951660156 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 1.8039040565490723, CUB disabled: 2.887648105621338, numpy: 99.24845123291016 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 1.7816959619522095, CUB disabled: 2.8669118881225586, numpy: 98.97516632080078 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 1.7913919687271118, CUB disabled: 3.2016000747680664, numpy: 98.81046295166016 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 1.8213119506835938, CUB disabled: 3.195456027984619, numpy: 97.8987808227539 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 1.7742719650268555, CUB disabled: 3.19321608543396, numpy: 97.84492492675781 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 1.7611839771270752, CUB disabled: 3.1928000450134277, numpy: 97.79612731933594 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 3.4439361095428467, CUB disabled: 9.356255531311035, numpy: 199.1514892578125 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 3.374016046524048, CUB disabled: 8.569184303283691, numpy: 197.59791564941406 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 3.37772798538208, CUB disabled: 16.95974349975586, numpy: 275.2898254394531 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 3.4699840545654297, CUB disabled: 16.32819175720215, numpy: 283.5480651855469 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 3.4528961181640625, CUB disabled: 17.406272888183594, numpy: 275.2522277832031 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 3.421056032180786, CUB disabled: 16.317472457885742, numpy: 271.332763671875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 3.4329919815063477, CUB disabled: 3.7416319847106934, numpy: 188.58534240722656 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 3.2832961082458496, CUB disabled: 3.710592031478882, numpy: 188.05235290527344 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 3.275007963180542, CUB disabled: 3.970815896987915, numpy: 194.90284729003906 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 3.273535966873169, CUB disabled: 3.9557440280914307, numpy: 195.95167541503906 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 3.276319980621338, CUB disabled: 3.960736036300659, numpy: 195.75875854492188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 3.2704319953918457, CUB disabled: 3.962815999984741, numpy: 201.28713989257812 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 3.7960638999938965, CUB disabled: 10.058815956115723, numpy: 217.2782745361328 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 3.437727928161621, CUB disabled: 9.160863876342773, numpy: 204.71717834472656 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 4.552000045776367, CUB disabled: 16.8470401763916, numpy: 671.4856567382812 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 4.693408012390137, CUB disabled: 16.077632904052734, numpy: 673.7072143554688 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 4.673855781555176, CUB disabled: 16.567903518676758, numpy: 678.3143920898438 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 4.717343807220459, CUB disabled: 15.643936157226562, numpy: 674.6201782226562 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 3.4284160137176514, CUB disabled: 3.7642879486083984, numpy: 200.7836151123047 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 3.4229440689086914, CUB disabled: 3.7304320335388184, numpy: 200.3353271484375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 3.518399953842163, CUB disabled: 4.3056960105896, numpy: 594.2807006835938 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 3.6610240936279297, CUB disabled: 4.308767795562744, numpy: 593.5577392578125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 3.523008108139038, CUB disabled: 4.229311943054199, numpy: 593.292236328125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 3.4826879501342773, CUB disabled: 4.233920097351074, numpy: 595.7320556640625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 6.607872009277344, CUB disabled: 11.619680404663086, numpy: 400.3094177246094 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 6.541855812072754, CUB disabled: 10.719327926635742, numpy: 394.0811462402344 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 6.878496170043945, CUB disabled: 22.427391052246094, numpy: 726.5156860351562 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 7.016064167022705, CUB disabled: 21.668960571289062, numpy: 734.8447875976562 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 6.941504001617432, CUB disabled: 22.77190399169922, numpy: 725.8125610351562 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 6.9003520011901855, CUB disabled: 21.833375930786133, numpy: 726.46728515625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 7.788832187652588, CUB disabled: 6.455840110778809, numpy: 396.5573425292969 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 7.7981438636779785, CUB disabled: 6.454432010650635, numpy: 385.9441223144531 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 6.703648090362549, CUB disabled: 6.85100793838501, numpy: 632.427734375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 6.685952186584473, CUB disabled: 6.8291521072387695, numpy: 638.6754760742188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 6.735968112945557, CUB disabled: 6.819039821624756, numpy: 635.2161254882812 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 6.708255767822266, CUB disabled: 6.812320232391357, numpy: 634.897705078125 (ms for 10 runs)P100 testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 158.17459106445312, CUB disabled: 192.68701171875, numpy: 254.448486328125 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 6.691711902618408, CUB disabled: 23.84649658203125, numpy: 275.48736572265625 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 7.471424102783203, CUB disabled: 213.34307861328125, numpy: 295.1151428222656 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 7.125919818878174, CUB disabled: 36.474143981933594, numpy: 293.6680603027344 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 7.476480007171631, CUB disabled: 218.4378204345703, numpy: 292.5193176269531 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 7.14137601852417, CUB disabled: 36.54336166381836, numpy: 293.55230712890625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 4.858687877655029, CUB disabled: 6.090015888214111, numpy: 236.162109375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 4.900608062744141, CUB disabled: 6.078688144683838, numpy: 226.48243713378906 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 4.90169620513916, CUB disabled: 7.151328086853027, numpy: 269.8446350097656 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 4.9027838706970215, CUB disabled: 7.132256031036377, numpy: 270.1627197265625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 4.907104015350342, CUB disabled: 7.1220479011535645, numpy: 269.7564697265625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 5.051519870758057, CUB disabled: 7.143008232116699, numpy: 268.67388916015625 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 9.227775573730469, CUB disabled: 164.5898895263672, numpy: 109.55430603027344 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 9.207839965820312, CUB disabled: 22.624256134033203, numpy: 109.48143768310547 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 10.012703895568848, CUB disabled: 170.60137939453125, numpy: 188.28297424316406 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 10.036447525024414, CUB disabled: 27.08883285522461, numpy: 188.72157287597656 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 10.208127975463867, CUB disabled: 249.4561004638672, numpy: 185.12342834472656 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 9.980159759521484, CUB disabled: 27.08415985107422, numpy: 184.98342895507812 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 2.5949440002441406, CUB disabled: 5.799903869628906, numpy: 107.14447784423828 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 2.576064109802246, CUB disabled: 5.746912002563477, numpy: 107.15792083740234 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 2.606816053390503, CUB disabled: 6.88259220123291, numpy: 107.11225891113281 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 2.5854079723358154, CUB disabled: 6.897503852844238, numpy: 107.15106964111328 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 2.5899200439453125, CUB disabled: 6.867455959320068, numpy: 107.16185760498047 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 2.589951992034912, CUB disabled: 6.889472007751465, numpy: 107.1480941772461 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 6.411776065826416, CUB disabled: 163.3095703125, numpy: 215.0654754638672 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 6.392255783081055, CUB disabled: 23.592191696166992, numpy: 214.8771209716797 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 7.2057600021362305, CUB disabled: 183.574951171875, numpy: 283.9091796875 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 7.1656317710876465, CUB disabled: 37.86649703979492, numpy: 280.8412170410156 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 7.2012481689453125, CUB disabled: 217.07376098632812, numpy: 283.88800048828125 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 7.159264087677002, CUB disabled: 37.8087043762207, numpy: 283.2445373535156 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 4.906911849975586, CUB disabled: 6.032288074493408, numpy: 214.05580139160156 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 4.892704010009766, CUB disabled: 6.031551837921143, numpy: 214.02120971679688 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 4.883327960968018, CUB disabled: 7.628384113311768, numpy: 214.1231689453125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 4.899424076080322, CUB disabled: 7.6326398849487305, numpy: 214.26368713378906 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 4.891871929168701, CUB disabled: 7.645535945892334, numpy: 214.16061401367188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 4.886752128601074, CUB disabled: 7.62505578994751, numpy: 214.15487670898438 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 6.561183929443359, CUB disabled: 195.32579040527344, numpy: 214.93238830566406 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 6.267519950866699, CUB disabled: 24.168127059936523, numpy: 215.0153350830078 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 10.781439781188965, CUB disabled: 227.33648681640625, numpy: 675.6613159179688 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 10.726655960083008, CUB disabled: 39.518367767333984, numpy: 675.7874145507812 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 11.331647872924805, CUB disabled: 222.86463928222656, numpy: 679.201171875 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 11.295616149902344, CUB disabled: 40.65657424926758, numpy: 679.2305297851562 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 5.213119983673096, CUB disabled: 6.094751834869385, numpy: 214.23231506347656 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 5.1902079582214355, CUB disabled: 6.089407920837402, numpy: 215.09222412109375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 5.313888072967529, CUB disabled: 11.043071746826172, numpy: 597.8594970703125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 5.155648231506348, CUB disabled: 11.02444839477539, numpy: 595.028564453125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 5.176447868347168, CUB disabled: 10.753503799438477, numpy: 592.8912353515625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 5.166143894195557, CUB disabled: 10.75494384765625, numpy: 591.92626953125 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 9.505727767944336, CUB disabled: 201.5979766845703, numpy: 430.1026306152344 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 9.487327575683594, CUB disabled: 29.772415161132812, numpy: 431.1112060546875 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.57203197479248, CUB disabled: 239.65811157226562, numpy: 780.9832153320312 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.535167694091797, CUB disabled: 49.23401641845703, numpy: 777.5499877929688 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 11.541312217712402, CUB disabled: 238.08543395996094, numpy: 777.5201416015625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 11.510815620422363, CUB disabled: 47.544193267822266, numpy: 782.5574951171875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 12.151040077209473, CUB disabled: 9.331232070922852, numpy: 427.8769836425781 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 12.151007652282715, CUB disabled: 9.328607559204102, numpy: 427.6109619140625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.710016250610352, CUB disabled: 11.771583557128906, numpy: 707.5844116210938 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.767647743225098, CUB disabled: 11.764448165893555, numpy: 706.73388671875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 11.73151969909668, CUB disabled: 11.554464340209961, numpy: 697.577392578125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 11.666943550109863, CUB disabled: 11.488544464111328, numpy: 701.9141845703125 (ms for 10 runs) |
|
@emcastillo thanks a lot for your help! This is really encouraging: except for I do have some result for loop + device reduce (see https://github.com/leofang/cupy/tree/cub_enhance_loop), and I find that only when the number of reduced elements is too large (so that the shared memory cannot fit all elements), and when using complex numbers, can this be faster than the segment approach. For example, with (device segment reduction only) (loop + device reduction) Presumably limited by kernel launch overhead. I don't think this route is worth pursuing given the numbers you posted, do you? |
|
I just rerun again on p100 and numbers look better now, probably it was kernel compilation and loading. I think that the PR as-is "could" be merged, but the performance differences on widely used devices are a bit scary. Let us discuss it! testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 6.818912029266357, CUB disabled: 22.72012710571289, numpy: 249.3230438232422 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 6.091648101806641, CUB disabled: 21.556543350219727, numpy: 235.09225463867188 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 6.661407947540283, CUB disabled: 34.2281608581543, numpy: 279.7550048828125 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 6.712863922119141, CUB disabled: 33.19340896606445, numpy: 278.5088195800781 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 6.616191864013672, CUB disabled: 34.026206970214844, numpy: 277.649169921875 (ms for 10 runs)
testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 6.6544318199157715, CUB disabled: 33.179359436035156, numpy: 277.8271179199219 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 4.811039924621582, CUB disabled: 5.815072059631348, numpy: 220.1744384765625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 4.875072002410889, CUB disabled: 5.779839992523193, numpy: 220.3600616455078 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 4.885183811187744, CUB disabled: 6.654367923736572, numpy: 255.6682586669922 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 5.021599769592285, CUB disabled: 6.6690239906311035, numpy: 249.99696350097656 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 4.867712020874023, CUB disabled: 6.650815963745117, numpy: 248.69007873535156 (ms for 10 runs)
testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 4.8449602127075195, CUB disabled: 6.645599842071533, numpy: 249.23257446289062 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 8.422752380371094, CUB disabled: 21.236576080322266, numpy: 108.02063751220703 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 8.319968223571777, CUB disabled: 20.40991973876953, numpy: 107.6957778930664 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 9.020575523376465, CUB disabled: 25.09596824645996, numpy: 179.50909423828125 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 9.005536079406738, CUB disabled: 24.457599639892578, numpy: 181.5095672607422 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 9.061247825622559, CUB disabled: 25.136991500854492, numpy: 182.08717346191406 (ms for 10 runs)
testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 9.234016418457031, CUB disabled: 24.479551315307617, numpy: 179.05197143554688 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 2.5538880825042725, CUB disabled: 5.382527828216553, numpy: 107.0943374633789 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 2.5557119846343994, CUB disabled: 5.4397759437561035, numpy: 107.14959716796875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 2.5591039657592773, CUB disabled: 6.270815849304199, numpy: 107.04188537597656 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 2.5492799282073975, CUB disabled: 6.255616188049316, numpy: 107.05094146728516 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 2.5402560234069824, CUB disabled: 6.257984161376953, numpy: 107.02543640136719 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 2.534559965133667, CUB disabled: 6.2350077629089355, numpy: 107.01696014404297 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 5.935711860656738, CUB disabled: 22.078399658203125, numpy: 214.9549102783203 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 5.870783805847168, CUB disabled: 21.29372787475586, numpy: 214.51788330078125 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 6.488128185272217, CUB disabled: 34.81692886352539, numpy: 286.2601013183594 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 6.547520160675049, CUB disabled: 34.171329498291016, numpy: 286.7164611816406 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 6.5498881340026855, CUB disabled: 34.92444610595703, numpy: 277.0263977050781 (ms for 10 runs)
testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 6.532735824584961, CUB disabled: 34.17174530029297, numpy: 274.8606262207031 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 4.863296031951904, CUB disabled: 5.78931188583374, numpy: 213.66358947753906 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 4.855231761932373, CUB disabled: 5.738815784454346, numpy: 213.61859130859375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 4.834496021270752, CUB disabled: 7.16326379776001, numpy: 214.7391357421875 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 4.897215843200684, CUB disabled: 7.1727681159973145, numpy: 213.8831024169922 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 4.847743988037109, CUB disabled: 7.162144184112549, numpy: 214.31671142578125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 4.87446403503418, CUB disabled: 7.168767929077148, numpy: 214.6196746826172 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 6.229184150695801, CUB disabled: 22.571487426757812, numpy: 215.57997131347656 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 5.906208038330078, CUB disabled: 21.88412857055664, numpy: 215.05865478515625 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 9.688287734985352, CUB disabled: 36.33760070800781, numpy: 676.1155395507812 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 9.821151733398438, CUB disabled: 35.6357421875, numpy: 677.0209350585938 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 10.258848190307617, CUB disabled: 37.816993713378906, numpy: 676.1074829101562 (ms for 10 runs)
testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 10.242976188659668, CUB disabled: 36.916927337646484, numpy: 675.514892578125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 5.13212776184082, CUB disabled: 5.773375988006592, numpy: 214.72213745117188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 5.136832237243652, CUB disabled: 5.758016109466553, numpy: 215.09359741210938 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 5.1075520515441895, CUB disabled: 10.019904136657715, numpy: 580.23876953125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 5.101376056671143, CUB disabled: 10.029855728149414, numpy: 584.4909057617188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 5.053152084350586, CUB disabled: 9.772128105163574, numpy: 589.012939453125 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 5.097792148590088, CUB disabled: 9.786656379699707, numpy: 588.0137329101562 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 9.454815864562988, CUB disabled: 27.629568099975586, numpy: 431.1047058105469 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 9.403871536254883, CUB disabled: 26.93824005126953, numpy: 429.32647705078125 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.250944137573242, CUB disabled: 45.729248046875, numpy: 764.6195068359375 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.341792106628418, CUB disabled: 44.971168518066406, numpy: 753.3809814453125 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 11.23862361907959, CUB disabled: 44.49545669555664, numpy: 757.4706420898438 (ms for 10 runs)
testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 11.258432388305664, CUB disabled: 43.4134407043457, numpy: 758.1614990234375 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 11.57699203491211, CUB disabled: 9.222047805786133, numpy: 427.8691101074219 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 11.541760444641113, CUB disabled: 9.215007781982422, numpy: 428.3423767089844 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.001376152038574, CUB disabled: 10.751808166503906, numpy: 678.6632690429688 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.003199577331543, CUB disabled: 10.73964786529541, numpy: 660.0641479492188 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 10.953215599060059, CUB disabled: 10.619680404663086, numpy: 677.80712890625 (ms for 10 runs)
testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 10.993791580200195, CUB disabled: 10.625439643859863, numpy: 671.8070678710938 (ms for 10 runs) |
First, do we agree that the performance issue is only for dim of axis > 1 + complex numbers? For all other cases, CUB segment reduction is always superior. Next, if we want to be more conservative, we could let complex numbers always fall back to the old reduction kernel (as you originally suggested). To be slightly less conservative, we could have a The next level is then to issue a Finally, I'd like to note that users can set |
Yes!
Maybe a fallback per compute capability might be good?
This could be an option but increases the number of parameters and its not trivial for users.
I think it is the most reasonable option. |
Great! 👍
Well, I don't have low-end GPUs (lower than K40) for testing, so I am not sure which CC is impacted other than 35, 60, and 70 according to all data collected above. Plus, I am not sure if there could be performance variance within the same CC. btw, what's the lowest CC that CuPy supports?
Agree.
If this is reasonable, then let me know what else I can do to further improve this PR. Thanks! |
|
IMO this PR does not need any further change. Thanks! |
|
We discussed that is better to put this behavior in the documentation rather than to raise a warning! |
|
@emcastillo May I propose we make the doc changes in a separate PR? I will keep track the needs in #2519. There are a few things to be discussed with you folks:
Thanks. |
|
Solved it!!!!! The alignment was the right cause. This is the new statistics for Only in the case of For a bigger array ( I suppose this suggests we can take away the performance warning? Could you verify this on P/V100 @emcastillo? We still need to discuss #2562 (comment) though. |
axis argumentaxis argument; Fix alignments for Thrust's complex types
|
Btw, fixing the alignment issue also helps speed up CUB device reduce (no |
|
Awesome work, I will do the check on V100 later |
|
@emcastillo Thanks. Removed! |
|
Jenkins, test this please |
|
Jenkins CI test (for commit cabb0ca, target branch master) succeeded! |

Part of #2519.
This PR does three things:
cupy/cuda/cub.pyxandcupy/cuda/cupy_cub.*to eliminate a lot of redundant lines for making the No.2 change below much easier;sum(axis=...),min(axis=...), andmax(axis=...)using CUB's device segmented reduce API. Because this API can only reduce over contiguous segments, any reduction with non-contiguous axes (e.g.,axis=(0, 2)forndim=3) is still deferred to the old reduction kernel.(I tried making a device-device copy to transpose the array at commit 31304f4 for non-contiguous cases, but the performance was just killed by copy. I can post an
nvvpscreenshot if anyone is interested.)Note that for complex numbers with an explicitaxis, the CUB performance may not be always better than the old reduction, so I added a warning. I believe this is because the hard-codedDeviceReducePolicyincub/device/dispatch/dispatch_reduce.cuhsaturated the shared memory and caused too many memory transactions when a contiguous block has too many elements, but I don't know how to specialize it for complex numbers with a smaller block size and/or fewer items per thread, so let us leave a performance-tuning PR for the future.UPDATE: this suboptimal performance is fixed by properly aligning the data (see No.3 above).
attn: @grlee77, @anaruse