Skip to content

Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types#2562

Merged
emcastillo merged 11 commits intocupy:masterfrom
leofang:cub_enhance
Nov 1, 2019
Merged

Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types#2562
emcastillo merged 11 commits intocupy:masterfrom
leofang:cub_enhance

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Oct 23, 2019

Part of #2519.

This PR does three things:

  1. Refactor cupy/cuda/cub.pyx and cupy/cuda/cupy_cub.* to eliminate a lot of redundant lines for making the No.2 change below much easier;
  2. Support sum(axis=...), min(axis=...), and max(axis=...) using CUB's device segmented reduce API. Because this API can only reduce over contiguous segments, any reduction with non-contiguous axes (e.g., axis=(0, 2) for ndim=3) is still deferred to the old reduction kernel.
  3. Enforce the alignments for Thrust's complex types to ensure performance

(I tried making a device-device copy to transpose the array at commit 31304f4 for non-contiguous cases, but the performance was just killed by copy. I can post an nvvp screenshot if anyone is interested.)

Note that for complex numbers with an explicit axis, the CUB performance may not be always better than the old reduction, so I added a warning. I believe this is because the hard-coded DeviceReducePolicy in cub/device/dispatch/dispatch_reduce.cuh saturated the shared memory and caused too many memory transactions when a contiguous block has too many elements, but I don't know how to specialize it for complex numbers with a smaller block size and/or fewer items per thread, so let us leave a performance-tuning PR for the future.

UPDATE: this suboptimal performance is fixed by properly aligning the data (see No.3 above).

attn: @grlee77, @anaruse

No segfault or invalid memory access happens in this version, but the
result is incorrect. Probably reduced axis is wrong?
Some slowdown (compared to CUB_DISABLED=1) is observed, especially
with non-contiguous axes. Oddly, sometimes axis=(1, 2) for ndim=3
is also slower, not sure why...
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 23, 2019

Test script for timing:

import cupy as cp
from itertools import permutations


n_runs = 10
shape = (512, 256, 256)
#axis_cases = []
#for r in range(1, len(shape)):
#    data = list(permutations((0, 1, 2), r))
#    axis_cases += list(set(tuple(sorted(item)) for item in data))
#print(axis_cases)
axis_cases = [(2,), (1, 2)]

for dtype in (cp.int64, cp.float32, cp.float64, cp.complex64, cp.complex128):
    if dtype in (cp.float32, cp.float64):
        x = cp.random.random(shape, dtype=dtype)
    elif dtype in (cp.int32, cp.int64):
        x = cp.random.randint(0, 10, size=shape, dtype=dtype)
    else:
        x = cp.random.random(shape).astype(dtype) + 1j * cp.random.random(shape).astype(dtype)
    x_np = cp.asnumpy(x) #move to cpu

    for axis in axis_cases:
        for func in ('sum', 'max', 'min'):
            for keepdims in (False, True):
                print("testing", axis, "+", str(dtype), "+", "keepdims={}".format(keepdims), "+", func, "...")
                start = cp.cuda.Event()
                end = cp.cuda.Event()

                cp.cuda.cub_enabled = False
                w = None
                start.record()
                for i in range(n_runs):
                    w = getattr(x, func)(axis=axis, keepdims=keepdims)
                end.record()
                end.synchronize()
                t_cp_disabled = cp.cuda.get_elapsed_time(start, end)

                cp.cuda.cub_enabled = True
                y = None
                start.record()
                for i in range(n_runs):
                    y = getattr(x, func)(axis=axis, keepdims=keepdims)
                end.record()
                end.synchronize()
                t_cp_enabled = cp.cuda.get_elapsed_time(start, end)

                z = None
                start.record()
                for i in range(n_runs):
                    z = getattr(x_np, func)(axis=axis, keepdims=keepdims)
                end.record()
                end.synchronize()
                t_np = cp.cuda.get_elapsed_time(start, end)

                print("CUB enabled: {}, CUB disabled: {}, numpy: {} (ms for {} runs)\n".format(t_cp_enabled, t_cp_disabled, t_np, n_runs))

                try:
                    assert cp.allclose(y, z)
                except AssertionError:
                    #print(y, z)
                    print("**************** RESULTS DO NOT MATCH ****************")
        print()

Results (on a Tesla K40):

testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 23.21049690246582, CUB disabled: 97.80284881591797, numpy: 293.8869323730469 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 21.285184860229492, CUB disabled: 96.49353790283203, numpy: 291.5436706542969 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 21.833375930786133, CUB disabled: 238.48565673828125, numpy: 426.699951171875 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 21.72879981994629, CUB disabled: 236.48611450195312, numpy: 339.5528869628906 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 22.16092872619629, CUB disabled: 238.17999267578125, numpy: 340.62969970703125 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 21.669696807861328, CUB disabled: 236.42681884765625, numpy: 341.7230529785156 (ms for 10 runs)


testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 16.709823608398438, CUB disabled: 23.147584915161133, numpy: 204.2683563232422 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 16.697439193725586, CUB disabled: 23.1177921295166, numpy: 202.71197509765625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 16.687583923339844, CUB disabled: 24.119712829589844, numpy: 315.8702392578125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 16.700576782226562, CUB disabled: 24.170143127441406, numpy: 310.73388671875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 16.720800399780273, CUB disabled: 24.135583877563477, numpy: 314.0164794921875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 16.514080047607422, CUB disabled: 24.13929557800293, numpy: 313.4980773925781 (ms for 10 runs)


testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 28.248544692993164, CUB disabled: 92.67865753173828, numpy: 137.1543731689453 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 28.250463485717773, CUB disabled: 91.4178237915039, numpy: 135.07212829589844 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 30.03673553466797, CUB disabled: 129.5340118408203, numpy: 183.66697692871094 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 30.036544799804688, CUB disabled: 128.01181030273438, numpy: 180.21347045898438 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 30.048927307128906, CUB disabled: 129.79168701171875, numpy: 178.98159790039062 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 30.19603157043457, CUB disabled: 128.01644897460938, numpy: 182.43055725097656 (ms for 10 runs)


testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 7.862656116485596, CUB disabled: 19.890527725219727, numpy: 120.35234832763672 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 7.823200225830078, CUB disabled: 19.897472381591797, numpy: 121.12246704101562 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 7.857312202453613, CUB disabled: 24.197856903076172, numpy: 89.19709014892578 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 7.873824119567871, CUB disabled: 24.279071807861328, numpy: 88.42854309082031 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 7.861631870269775, CUB disabled: 24.255168914794922, numpy: 88.70995330810547 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 7.879424095153809, CUB disabled: 24.293344497680664, numpy: 89.95187377929688 (ms for 10 runs)


testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 20.730464935302734, CUB disabled: 91.43590545654297, numpy: 172.49183654785156 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 20.73676872253418, CUB disabled: 89.9533462524414, numpy: 170.43568420410156 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 20.852800369262695, CUB disabled: 236.86781311035156, numpy: 251.40383911132812 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 20.828704833984375, CUB disabled: 235.07571411132812, numpy: 251.66355895996094 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 20.82080078125, CUB disabled: 236.87501525878906, numpy: 253.11004638671875 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 20.833696365356445, CUB disabled: 235.10073852539062, numpy: 253.92617797851562 (ms for 10 runs)


testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 16.528640747070312, CUB disabled: 21.84921646118164, numpy: 160.9127655029297 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 16.55824089050293, CUB disabled: 21.872127532958984, numpy: 160.8041229248047 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 16.505695343017578, CUB disabled: 26.589216232299805, numpy: 169.4571533203125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 16.563007354736328, CUB disabled: 26.587488174438477, numpy: 170.53878784179688 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 16.656415939331055, CUB disabled: 26.599679946899414, numpy: 167.86175537109375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 16.577119827270508, CUB disabled: 26.634592056274414, numpy: 167.5244140625 (ms for 10 runs)


testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
test_CUB_reduction_segmented2.py:50: PerformanceWarning: CUB reduction for complex numbers may not be highly performant. If concerned, set cupy.cuda.cub_enabled=False to switch to CuPy's internal reduction routine and compare the timings.
  y = getattr(x, func)(axis=axis, keepdims=keepdims)
CUB enabled: 23.10438346862793, CUB disabled: 100.88428497314453, numpy: 256.8711242675781 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 22.03398323059082, CUB disabled: 99.3512954711914, numpy: 254.5529022216797 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 38.65875244140625, CUB disabled: 208.4864044189453, numpy: 756.0455322265625 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 38.64329528808594, CUB disabled: 206.7027587890625, numpy: 757.1634521484375 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 39.79123306274414, CUB disabled: 208.6664581298828, numpy: 761.587890625 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 39.73331069946289, CUB disabled: 206.92723083496094, numpy: 758.5726928710938 (ms for 10 runs)


testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 39.48934555053711, CUB disabled: 22.78950309753418, numpy: 245.1492462158203 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 38.345375061035156, CUB disabled: 22.80735969543457, numpy: 242.8602294921875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 34.917823791503906, CUB disabled: 35.21001434326172, numpy: 660.6533203125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 35.011104583740234, CUB disabled: 35.20755386352539, numpy: 656.068359375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 35.034175872802734, CUB disabled: 34.48457717895508, numpy: 653.321044921875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 34.84735870361328, CUB disabled: 34.49331283569336, numpy: 651.9085693359375 (ms for 10 runs)


testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 34.13123321533203, CUB disabled: 125.0881576538086, numpy: 332.6822509765625 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 34.13324737548828, CUB disabled: 123.69213104248047, numpy: 330.7550964355469 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 53.477088928222656, CUB disabled: 402.5926208496094, numpy: 784.7593383789062 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 53.6071662902832, CUB disabled: 400.9013671875, numpy: 785.705810546875 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 53.289344787597656, CUB disabled: 404.80047607421875, numpy: 798.3240356445312 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 53.289119720458984, CUB disabled: 403.0973205566406, numpy: 785.0084838867188 (ms for 10 runs)


testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 77.00498962402344, CUB disabled: 35.658241271972656, numpy: 316.6164855957031 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 76.99961853027344, CUB disabled: 35.57235336303711, numpy: 317.3782958984375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 70.05232238769531, CUB disabled: 42.08127975463867, numpy: 693.8490600585938 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 70.0739517211914, CUB disabled: 42.076255798339844, numpy: 691.0841064453125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 69.90096282958984, CUB disabled: 41.422977447509766, numpy: 691.4490356445312 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 69.95222473144531, CUB disabled: 41.337310791015625, numpy: 690.6539306640625 (ms for 10 runs)

elif cub.can_use_device_segmented_reduce(
cub.CUPY_CUB_MAX, self.dtype, self.ndim, axis, dtype):
if self.dtype in (numpy.complex64, numpy.complex128):
warnings.warn("CUB reduction for complex numbers may not be "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when dim of axis > 1, and dtype is complex then is better to use the old reduce implementations,
how about not using cub in such cases?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note complex64 seems OK, sometimes it’s even better.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 28, 2019

@emcastillo I might have found the reason for suboptimal performance for complex numbers. Let me run more experiments to confirm and make necessary changes. I'll let you know when it's ready for review. Thanks.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 29, 2019

As mentioned in the PR description, I suspect the DeviceReducePolicy for complex numbers is not optimal, hence causing worse performance. However, if I attempt to specialize for complex numbers in cupy/cuda/cupy_cub.cu, say like this:

template<typename ReductionOpT>
struct DeviceReducePolicy<complex<float>, int, ReductionOpT>
{
    struct PolicyXXX : ChainedPolicy<600, PolicyXXX, PolicyXXX>
    {
        typedef AgentReducePolicy<
                CUB_SCALED_GRANULARITIES(256, 8, complex<float>), // items per thread is cut in half
                4,                                      
                BLOCK_REDUCE_WARP_REDUCTIONS,         
                LOAD_LDG>                               
            ReducePolicy;

        // SingleTilePolicy
        typedef ReducePolicy SingleTilePolicy;

        // SegmentedReducePolicy
        typedef ReducePolicy SegmentedReducePolicy;
    };
    /// MaxPolicy
    typedef PolicyXXX MaxPolicy;
};

I got a bunch of long compiler errors like this:

    cupy/cuda/cupy_cub.cu(124): error: class "cub::DeviceReducePolicy<thrust::complex<float>, int, ReductionOpT>" cannot be partially specialized in the current scope

So I am unable to confirm if an alternative policy helps or not.

@emcastillo any comment or suggestion? Thanks.

@emcastillo
Copy link
Copy Markdown
Member

I guess you already know it but have you tried to put the general template before your specialization? That error appears in such case

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 30, 2019

Thanks a lot for sharing your thoughts, @emcastillo! Any help is appreciated here! It is really weird to get worse performance for complex numbers with this segmented approach. Without axis (and hence using "device reduce" instead of "device segmented reduce"), I see that the reduction is much faster than with any contiguous axis, even for complex numbers. So there's definitely room for improvement.

The DeviceReducePolicy template appears in cub/device/dispatch/dispatch_reduce.cuh, which is then further inherited in the same header by both DispatchReduce and DispatchSegmentedReduce. Here by including cub/device/device_segmented_reduce.cuh I suppose that general template is exposed prior to my specialization?

I also tried a full specialization like struct DeviceReducePolicy<complex<float>, int, Sum> (it's actually cub::Sum, but we use the cub namespace here), got the same error.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 30, 2019

I decided to tweak the CUB source directly to bypass the template specialization error. It seems on K40 the timing doesn't change by changing slightly the values in DeviceReducePolicy. Not sure if this performance issue persists on P100 or V100. I feel K40 may not be fully supporting it (see below). I don't have the environment to test this though....

One odd thing I noticed: K40 supports up to SM35, but Policy600 (for SM60) was chosen (can tell this through nvprof). Not sure why this happens....

Now I'm experimenting looping over the array when the segment size is too large (empirically choosing 256) and calling device reduce.

@emcastillo
Copy link
Copy Markdown
Member

I can test it on V and P100

@emcastillo
Copy link
Copy Markdown
Member

emcastillo commented Oct 30, 2019

Results of your branch on V100

testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 4.321887969970703, CUB disabled: 10.010751724243164, numpy: 242.3883819580078 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 3.544543981552124, CUB disabled: 8.817376136779785, numpy: 238.95021057128906 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 3.5083839893341064, CUB disabled: 15.883711814880371, numpy: 273.5198974609375 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 3.390144109725952, CUB disabled: 15.018112182617188, numpy: 273.7151794433594 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 3.4464640617370605, CUB disabled: 15.858176231384277, numpy: 274.9762268066406 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 3.4031999111175537, CUB disabled: 15.016863822937012, numpy: 275.6973876953125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 3.331455945968628, CUB disabled: 3.756608009338379, numpy: 238.29052734375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 3.4645440578460693, CUB disabled: 3.7578558921813965, numpy: 232.7453155517578 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 3.3107519149780273, CUB disabled: 3.8134400844573975, numpy: 251.29388427734375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 3.2954559326171875, CUB disabled: 3.7946560382843018, numpy: 251.232421875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 3.3486080169677734, CUB disabled: 3.811392068862915, numpy: 250.1583709716797 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 3.2591679096221924, CUB disabled: 3.7887039184570312, numpy: 251.49017333984375 (ms for 10 runs)


testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 3.431936025619507, CUB disabled: 9.059328079223633, numpy: 105.90569305419922 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 3.2592320442199707, CUB disabled: 8.244640350341797, numpy: 105.4345932006836 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 3.634687900543213, CUB disabled: 11.441311836242676, numpy: 170.00799560546875 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 3.673151969909668, CUB disabled: 10.797183990478516, numpy: 173.68313598632812 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 3.8667519092559814, CUB disabled: 12.006815910339355, numpy: 172.03225708007812 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 3.691296100616455, CUB disabled: 10.80246353149414, numpy: 169.44837951660156 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 1.8039040565490723, CUB disabled: 2.887648105621338, numpy: 99.24845123291016 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 1.7816959619522095, CUB disabled: 2.8669118881225586, numpy: 98.97516632080078 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 1.7913919687271118, CUB disabled: 3.2016000747680664, numpy: 98.81046295166016 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 1.8213119506835938, CUB disabled: 3.195456027984619, numpy: 97.8987808227539 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 1.7742719650268555, CUB disabled: 3.19321608543396, numpy: 97.84492492675781 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 1.7611839771270752, CUB disabled: 3.1928000450134277, numpy: 97.79612731933594 (ms for 10 runs)


testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 3.4439361095428467, CUB disabled: 9.356255531311035, numpy: 199.1514892578125 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 3.374016046524048, CUB disabled: 8.569184303283691, numpy: 197.59791564941406 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 3.37772798538208, CUB disabled: 16.95974349975586, numpy: 275.2898254394531 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 3.4699840545654297, CUB disabled: 16.32819175720215, numpy: 283.5480651855469 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 3.4528961181640625, CUB disabled: 17.406272888183594, numpy: 275.2522277832031 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 3.421056032180786, CUB disabled: 16.317472457885742, numpy: 271.332763671875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 3.4329919815063477, CUB disabled: 3.7416319847106934, numpy: 188.58534240722656 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 3.2832961082458496, CUB disabled: 3.710592031478882, numpy: 188.05235290527344 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 3.275007963180542, CUB disabled: 3.970815896987915, numpy: 194.90284729003906 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 3.273535966873169, CUB disabled: 3.9557440280914307, numpy: 195.95167541503906 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 3.276319980621338, CUB disabled: 3.960736036300659, numpy: 195.75875854492188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 3.2704319953918457, CUB disabled: 3.962815999984741, numpy: 201.28713989257812 (ms for 10 runs)


testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 3.7960638999938965, CUB disabled: 10.058815956115723, numpy: 217.2782745361328 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 3.437727928161621, CUB disabled: 9.160863876342773, numpy: 204.71717834472656 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 4.552000045776367, CUB disabled: 16.8470401763916, numpy: 671.4856567382812 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 4.693408012390137, CUB disabled: 16.077632904052734, numpy: 673.7072143554688 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 4.673855781555176, CUB disabled: 16.567903518676758, numpy: 678.3143920898438 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 4.717343807220459, CUB disabled: 15.643936157226562, numpy: 674.6201782226562 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 3.4284160137176514, CUB disabled: 3.7642879486083984, numpy: 200.7836151123047 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 3.4229440689086914, CUB disabled: 3.7304320335388184, numpy: 200.3353271484375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 3.518399953842163, CUB disabled: 4.3056960105896, numpy: 594.2807006835938 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 3.6610240936279297, CUB disabled: 4.308767795562744, numpy: 593.5577392578125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 3.523008108139038, CUB disabled: 4.229311943054199, numpy: 593.292236328125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 3.4826879501342773, CUB disabled: 4.233920097351074, numpy: 595.7320556640625 (ms for 10 runs)


testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 6.607872009277344, CUB disabled: 11.619680404663086, numpy: 400.3094177246094 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 6.541855812072754, CUB disabled: 10.719327926635742, numpy: 394.0811462402344 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 6.878496170043945, CUB disabled: 22.427391052246094, numpy: 726.5156860351562 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 7.016064167022705, CUB disabled: 21.668960571289062, numpy: 734.8447875976562 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 6.941504001617432, CUB disabled: 22.77190399169922, numpy: 725.8125610351562 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 6.9003520011901855, CUB disabled: 21.833375930786133, numpy: 726.46728515625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 7.788832187652588, CUB disabled: 6.455840110778809, numpy: 396.5573425292969 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 7.7981438636779785, CUB disabled: 6.454432010650635, numpy: 385.9441223144531 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 6.703648090362549, CUB disabled: 6.85100793838501, numpy: 632.427734375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 6.685952186584473, CUB disabled: 6.8291521072387695, numpy: 638.6754760742188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 6.735968112945557, CUB disabled: 6.819039821624756, numpy: 635.2161254882812 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 6.708255767822266, CUB disabled: 6.812320232391357, numpy: 634.897705078125 (ms for 10 runs)

P100

testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 158.17459106445312, CUB disabled: 192.68701171875, numpy: 254.448486328125 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 6.691711902618408, CUB disabled: 23.84649658203125, numpy: 275.48736572265625 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 7.471424102783203, CUB disabled: 213.34307861328125, numpy: 295.1151428222656 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 7.125919818878174, CUB disabled: 36.474143981933594, numpy: 293.6680603027344 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 7.476480007171631, CUB disabled: 218.4378204345703, numpy: 292.5193176269531 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 7.14137601852417, CUB disabled: 36.54336166381836, numpy: 293.55230712890625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 4.858687877655029, CUB disabled: 6.090015888214111, numpy: 236.162109375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 4.900608062744141, CUB disabled: 6.078688144683838, numpy: 226.48243713378906 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 4.90169620513916, CUB disabled: 7.151328086853027, numpy: 269.8446350097656 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 4.9027838706970215, CUB disabled: 7.132256031036377, numpy: 270.1627197265625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 4.907104015350342, CUB disabled: 7.1220479011535645, numpy: 269.7564697265625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 5.051519870758057, CUB disabled: 7.143008232116699, numpy: 268.67388916015625 (ms for 10 runs)


testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 9.227775573730469, CUB disabled: 164.5898895263672, numpy: 109.55430603027344 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 9.207839965820312, CUB disabled: 22.624256134033203, numpy: 109.48143768310547 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 10.012703895568848, CUB disabled: 170.60137939453125, numpy: 188.28297424316406 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 10.036447525024414, CUB disabled: 27.08883285522461, numpy: 188.72157287597656 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 10.208127975463867, CUB disabled: 249.4561004638672, numpy: 185.12342834472656 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 9.980159759521484, CUB disabled: 27.08415985107422, numpy: 184.98342895507812 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 2.5949440002441406, CUB disabled: 5.799903869628906, numpy: 107.14447784423828 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 2.576064109802246, CUB disabled: 5.746912002563477, numpy: 107.15792083740234 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 2.606816053390503, CUB disabled: 6.88259220123291, numpy: 107.11225891113281 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 2.5854079723358154, CUB disabled: 6.897503852844238, numpy: 107.15106964111328 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 2.5899200439453125, CUB disabled: 6.867455959320068, numpy: 107.16185760498047 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 2.589951992034912, CUB disabled: 6.889472007751465, numpy: 107.1480941772461 (ms for 10 runs)


testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 6.411776065826416, CUB disabled: 163.3095703125, numpy: 215.0654754638672 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 6.392255783081055, CUB disabled: 23.592191696166992, numpy: 214.8771209716797 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 7.2057600021362305, CUB disabled: 183.574951171875, numpy: 283.9091796875 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 7.1656317710876465, CUB disabled: 37.86649703979492, numpy: 280.8412170410156 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 7.2012481689453125, CUB disabled: 217.07376098632812, numpy: 283.88800048828125 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 7.159264087677002, CUB disabled: 37.8087043762207, numpy: 283.2445373535156 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 4.906911849975586, CUB disabled: 6.032288074493408, numpy: 214.05580139160156 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 4.892704010009766, CUB disabled: 6.031551837921143, numpy: 214.02120971679688 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 4.883327960968018, CUB disabled: 7.628384113311768, numpy: 214.1231689453125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 4.899424076080322, CUB disabled: 7.6326398849487305, numpy: 214.26368713378906 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 4.891871929168701, CUB disabled: 7.645535945892334, numpy: 214.16061401367188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 4.886752128601074, CUB disabled: 7.62505578994751, numpy: 214.15487670898438 (ms for 10 runs)


testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 6.561183929443359, CUB disabled: 195.32579040527344, numpy: 214.93238830566406 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 6.267519950866699, CUB disabled: 24.168127059936523, numpy: 215.0153350830078 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 10.781439781188965, CUB disabled: 227.33648681640625, numpy: 675.6613159179688 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 10.726655960083008, CUB disabled: 39.518367767333984, numpy: 675.7874145507812 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 11.331647872924805, CUB disabled: 222.86463928222656, numpy: 679.201171875 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 11.295616149902344, CUB disabled: 40.65657424926758, numpy: 679.2305297851562 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 5.213119983673096, CUB disabled: 6.094751834869385, numpy: 214.23231506347656 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 5.1902079582214355, CUB disabled: 6.089407920837402, numpy: 215.09222412109375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 5.313888072967529, CUB disabled: 11.043071746826172, numpy: 597.8594970703125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 5.155648231506348, CUB disabled: 11.02444839477539, numpy: 595.028564453125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 5.176447868347168, CUB disabled: 10.753503799438477, numpy: 592.8912353515625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 5.166143894195557, CUB disabled: 10.75494384765625, numpy: 591.92626953125 (ms for 10 runs)


testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 9.505727767944336, CUB disabled: 201.5979766845703, numpy: 430.1026306152344 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 9.487327575683594, CUB disabled: 29.772415161132812, numpy: 431.1112060546875 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.57203197479248, CUB disabled: 239.65811157226562, numpy: 780.9832153320312 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.535167694091797, CUB disabled: 49.23401641845703, numpy: 777.5499877929688 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 11.541312217712402, CUB disabled: 238.08543395996094, numpy: 777.5201416015625 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 11.510815620422363, CUB disabled: 47.544193267822266, numpy: 782.5574951171875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 12.151040077209473, CUB disabled: 9.331232070922852, numpy: 427.8769836425781 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 12.151007652282715, CUB disabled: 9.328607559204102, numpy: 427.6109619140625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.710016250610352, CUB disabled: 11.771583557128906, numpy: 707.5844116210938 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.767647743225098, CUB disabled: 11.764448165893555, numpy: 706.73388671875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 11.73151969909668, CUB disabled: 11.554464340209961, numpy: 697.577392578125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 11.666943550109863, CUB disabled: 11.488544464111328, numpy: 701.9141845703125 (ms for 10 runs)

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 30, 2019

@emcastillo thanks a lot for your help! This is really encouraging: except for testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ... on P100 which was being abnormal, in almost all cases CUB segment reduction is either faster or competitive with the old reduction kernel, even for complex numbers. What do you think if we proceed with this PR as is?

I do have some result for loop + device reduce (see https://github.com/leofang/cupy/tree/cub_enhance_loop), and I find that only when the number of reduced elements is too large (so that the shared memory cannot fit all elements), and when using complex numbers, can this be faster than the segment approach. For example, with shape=(64, 128, 128, 128) and axis=(1, 2, 3), this is what I got on K40:

(device segment reduction only)

(64, 128, 128, 128)
testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 64.27254486083984, CUB disabled: 119.3239974975586, numpy: 810.6727294921875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 31.385408401489258, CUB disabled: 101.96959686279297, numpy: 478.1734924316406 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 62.205055236816406, CUB disabled: 112.30735778808594, numpy: 644.5498046875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + sum ...
test_CUB_reduction_segmented2_profiling.py:63: PerformanceWarning: CUB reduction for complex numbers may not be highly performant. If concerned, set cupy.cuda.cub_enabled=False to switch to CuPy's internal reduction routine and compare the timings.
  y = getattr(x, func)(axis=axis, keepdims=keepdims)
CUB enabled: 144.87974548339844, CUB disabled: 116.86566162109375, numpy: 964.119140625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 287.25518798828125, CUB disabled: 183.4674835205078, numpy: 1264.8275146484375 (ms for 10 runs)

(loop + device reduction)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 71.8760986328125, CUB disabled: 119.16812896728516, numpy: 817.7650146484375 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 40.36908721923828, CUB disabled: 101.73426818847656, numpy: 481.5826416015625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 70.78514862060547, CUB disabled: 112.27244567871094, numpy: 646.6651000976562 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + sum ...
test_CUB_reduction_segmented2_profiling.py:63: PerformanceWarning: CUB reduction for complex numbers may not be highly performant. If concerned, set cupy.cuda.cub_enabled=False to switch to CuPy's internal reduction routine and compare the timings.
  y = getattr(x, func)(axis=axis, keepdims=keepdims)
CUB enabled: 137.6884765625, CUB disabled: 116.92912292480469, numpy: 960.3656005859375 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 262.8611145019531, CUB disabled: 183.7587890625, numpy: 1265.3023681640625 (ms for 10 runs)

Presumably limited by kernel launch overhead. I don't think this route is worth pursuing given the numbers you posted, do you?

@emcastillo
Copy link
Copy Markdown
Member

I just rerun again on p100 and numbers look better now, probably it was kernel compilation and loading.

I think that the PR as-is "could" be merged, but the performance differences on widely used devices are a bit scary. Let us discuss it!

testing (2,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 6.818912029266357, CUB disabled: 22.72012710571289, numpy: 249.3230438232422 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 6.091648101806641, CUB disabled: 21.556543350219727, numpy: 235.09225463867188 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 6.661407947540283, CUB disabled: 34.2281608581543, numpy: 279.7550048828125 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 6.712863922119141, CUB disabled: 33.19340896606445, numpy: 278.5088195800781 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 6.616191864013672, CUB disabled: 34.026206970214844, numpy: 277.649169921875 (ms for 10 runs)

testing (2,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 6.6544318199157715, CUB disabled: 33.179359436035156, numpy: 277.8271179199219 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 4.811039924621582, CUB disabled: 5.815072059631348, numpy: 220.1744384765625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 4.875072002410889, CUB disabled: 5.779839992523193, numpy: 220.3600616455078 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 4.885183811187744, CUB disabled: 6.654367923736572, numpy: 255.6682586669922 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 5.021599769592285, CUB disabled: 6.6690239906311035, numpy: 249.99696350097656 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 4.867712020874023, CUB disabled: 6.650815963745117, numpy: 248.69007873535156 (ms for 10 runs)

testing (1, 2) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 4.8449602127075195, CUB disabled: 6.645599842071533, numpy: 249.23257446289062 (ms for 10 runs)


testing (2,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 8.422752380371094, CUB disabled: 21.236576080322266, numpy: 108.02063751220703 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 8.319968223571777, CUB disabled: 20.40991973876953, numpy: 107.6957778930664 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 9.020575523376465, CUB disabled: 25.09596824645996, numpy: 179.50909423828125 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 9.005536079406738, CUB disabled: 24.457599639892578, numpy: 181.5095672607422 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 9.061247825622559, CUB disabled: 25.136991500854492, numpy: 182.08717346191406 (ms for 10 runs)

testing (2,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 9.234016418457031, CUB disabled: 24.479551315307617, numpy: 179.05197143554688 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 2.5538880825042725, CUB disabled: 5.382527828216553, numpy: 107.0943374633789 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 2.5557119846343994, CUB disabled: 5.4397759437561035, numpy: 107.14959716796875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 2.5591039657592773, CUB disabled: 6.270815849304199, numpy: 107.04188537597656 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 2.5492799282073975, CUB disabled: 6.255616188049316, numpy: 107.05094146728516 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 2.5402560234069824, CUB disabled: 6.257984161376953, numpy: 107.02543640136719 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 2.534559965133667, CUB disabled: 6.2350077629089355, numpy: 107.01696014404297 (ms for 10 runs)


testing (2,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 5.935711860656738, CUB disabled: 22.078399658203125, numpy: 214.9549102783203 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 5.870783805847168, CUB disabled: 21.29372787475586, numpy: 214.51788330078125 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 6.488128185272217, CUB disabled: 34.81692886352539, numpy: 286.2601013183594 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 6.547520160675049, CUB disabled: 34.171329498291016, numpy: 286.7164611816406 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 6.5498881340026855, CUB disabled: 34.92444610595703, numpy: 277.0263977050781 (ms for 10 runs)

testing (2,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 6.532735824584961, CUB disabled: 34.17174530029297, numpy: 274.8606262207031 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 4.863296031951904, CUB disabled: 5.78931188583374, numpy: 213.66358947753906 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 4.855231761932373, CUB disabled: 5.738815784454346, numpy: 213.61859130859375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 4.834496021270752, CUB disabled: 7.16326379776001, numpy: 214.7391357421875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 4.897215843200684, CUB disabled: 7.1727681159973145, numpy: 213.8831024169922 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 4.847743988037109, CUB disabled: 7.162144184112549, numpy: 214.31671142578125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 4.87446403503418, CUB disabled: 7.168767929077148, numpy: 214.6196746826172 (ms for 10 runs)


testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 6.229184150695801, CUB disabled: 22.571487426757812, numpy: 215.57997131347656 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 5.906208038330078, CUB disabled: 21.88412857055664, numpy: 215.05865478515625 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 9.688287734985352, CUB disabled: 36.33760070800781, numpy: 676.1155395507812 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 9.821151733398438, CUB disabled: 35.6357421875, numpy: 677.0209350585938 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 10.258848190307617, CUB disabled: 37.816993713378906, numpy: 676.1074829101562 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 10.242976188659668, CUB disabled: 36.916927337646484, numpy: 675.514892578125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 5.13212776184082, CUB disabled: 5.773375988006592, numpy: 214.72213745117188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 5.136832237243652, CUB disabled: 5.758016109466553, numpy: 215.09359741210938 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 5.1075520515441895, CUB disabled: 10.019904136657715, numpy: 580.23876953125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 5.101376056671143, CUB disabled: 10.029855728149414, numpy: 584.4909057617188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 5.053152084350586, CUB disabled: 9.772128105163574, numpy: 589.012939453125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 5.097792148590088, CUB disabled: 9.786656379699707, numpy: 588.0137329101562 (ms for 10 runs)


testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 9.454815864562988, CUB disabled: 27.629568099975586, numpy: 431.1047058105469 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 9.403871536254883, CUB disabled: 26.93824005126953, numpy: 429.32647705078125 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.250944137573242, CUB disabled: 45.729248046875, numpy: 764.6195068359375 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.341792106628418, CUB disabled: 44.971168518066406, numpy: 753.3809814453125 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 11.23862361907959, CUB disabled: 44.49545669555664, numpy: 757.4706420898438 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 11.258432388305664, CUB disabled: 43.4134407043457, numpy: 758.1614990234375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 11.57699203491211, CUB disabled: 9.222047805786133, numpy: 427.8691101074219 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 11.541760444641113, CUB disabled: 9.215007781982422, numpy: 428.3423767089844 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 11.001376152038574, CUB disabled: 10.751808166503906, numpy: 678.6632690429688 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 11.003199577331543, CUB disabled: 10.73964786529541, numpy: 660.0641479492188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 10.953215599060059, CUB disabled: 10.619680404663086, numpy: 677.80712890625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 10.993791580200195, CUB disabled: 10.625439643859863, numpy: 671.8070678710938 (ms for 10 runs)

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 30, 2019

I think that the PR as-is "could" be merged, but the performance differences on widely used devices are a bit scary. Let us discuss it!

First, do we agree that the performance issue is only for dim of axis > 1 + complex numbers? For all other cases, CUB segment reduction is always superior.

Next, if we want to be more conservative, we could let complex numbers always fall back to the old reduction kernel (as you originally suggested).

To be slightly less conservative, we could have a CUB_COMPLEX_DISABLED env variable (we already have CUB_DISABLED and CUB_PATH for this peculiar module anyway...) that controls the behavior. (And make it default to "disabled".)

The next level is then to issue a PerformanceWarning warning (like what I do here) to raise awareness.

Finally, I'd like to note that users can set cupy.cuda.cub_enabled=False to fallback (as we do in the test script), so the behavior can be changed in runtime.

@emcastillo
Copy link
Copy Markdown
Member

emcastillo commented Oct 30, 2019

First, do we agree that the performance issue is only for dim of axis > 1 + complex numbers? For all other cases, CUB segment reduction is always superior.

Yes!

Next, if we want to be more conservative, we could let complex numbers always fall back to the old reduction kernel (as you originally suggested).
I don't think this is good, as the performance improvements are large in several cases.

Maybe a fallback per compute capability might be good?

To be slightly less conservative, we could have a CUB_COMPLEX_DISABLED env variable (we already have CUB_DISABLED and CUB_PATH for this peculiar module anyway...) that controls the behavior. (And make it default to "disabled".)

This could be an option but increases the number of parameters and its not trivial for users.

The next level is then to issue a PerformanceWarning warning (like what I do here) to raise awareness.

I think it is the most reasonable option.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 30, 2019

First, do we agree that the performance issue is only for dim of axis > 1 + complex numbers? For all other cases, CUB segment reduction is always superior.

Yes!

Great! 👍

Next, if we want to be more conservative, we could let complex numbers always fall back to the old reduction kernel (as you originally suggested).

I don't think this is good, as the performance improvements are large in several cases.
Maybe a fallback per compute capability might be good?

Well, I don't have low-end GPUs (lower than K40) for testing, so I am not sure which CC is impacted other than 35, 60, and 70 according to all data collected above. Plus, I am not sure if there could be performance variance within the same CC.

btw, what's the lowest CC that CuPy supports?

To be slightly less conservative, we could have a CUB_COMPLEX_DISABLED env variable (we already have CUB_DISABLED and CUB_PATH for this peculiar module anyway...) that controls the behavior. (And make it default to "disabled".)

This could be an option but increases the number of parameters and its not trivial for users.

Agree.

The next level is then to issue a PerformanceWarning warning (like what I do here) to raise awareness.

I think it is the most reasonable option.

If this is reasonable, then let me know what else I can do to further improve this PR.

Thanks!

@emcastillo
Copy link
Copy Markdown
Member

emcastillo commented Oct 31, 2019

IMO this PR does not need any further change.
But I want to show it to some members of the team today before giving the green light!

Thanks!

@emcastillo
Copy link
Copy Markdown
Member

We discussed that is better to put this behavior in the documentation rather than to raise a warning!
Thanks!

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 31, 2019

@emcastillo May I propose we make the doc changes in a separate PR? I will keep track the needs in #2519. There are a few things to be discussed with you folks:

Thanks.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 31, 2019

Back to the performance issue. Yesterday I ran it through nvprof, and I guess because complex<T> is not properly aligned, the global memory load is worse at least by a factor of 2.

For example, for float the efficiency is 100%, so for complex<float>I'd expect a 50% efficiency, but I got only 25% (see the red circle):
螢幕快照 2019-10-30 下午4 54 36
By the way, the load efficiency for double, which has the same size as complex<float>, is also 100%...

Any suggestion how I can enforce the alignment? I tried changing this line


to struct alignas(sizeof(T)*2) complex {. Building CuPy works, but when actually running CuPy, the NVRTC compiler errors out.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 31, 2019

Solved it!!!!! The alignment was the right cause. This is the new statistics for shape=(512, 256, 256):

testing (2,) + <class 'numpy.complex64'> + keepdims=False + sum ...
test_CUB_reduction_segmented4.py:43: PerformanceWarning: CUB reduction for complex numbers may not be highly performant. If concerned, set cupy.cuda.cub_enabled=False to switch to CuPy's internal reduction routine and compare the timings.
  y = getattr(x, func)(axis=axis, keepdims=keepdims)
CUB enabled: 21.589439392089844, CUB disabled: 93.0334701538086, numpy: 252.68380737304688 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 20.405792236328125, CUB disabled: 91.2323226928711, numpy: 251.8441619873047 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 36.77443313598633, CUB disabled: 308.7342224121094, numpy: 750.37548828125 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 36.791072845458984, CUB disabled: 306.9769592285156, numpy: 749.851318359375 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 39.43900680541992, CUB disabled: 304.9297180175781, numpy: 748.7946166992188 (ms for 10 runs)

testing (2,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 39.42758560180664, CUB disabled: 303.3099670410156, numpy: 748.8102416992188 (ms for 10 runs)


testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 19.048479080200195, CUB disabled: 21.786975860595703, numpy: 240.23968505859375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 19.756000518798828, CUB disabled: 21.738367080688477, numpy: 239.53424072265625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 18.61587142944336, CUB disabled: 33.71635055541992, numpy: 650.2034912109375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 18.659072875976562, CUB disabled: 33.72467041015625, numpy: 649.459716796875 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 18.6615047454834, CUB disabled: 32.596065521240234, numpy: 650.4429931640625 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 18.658367156982422, CUB disabled: 32.60847854614258, numpy: 651.3851928710938 (ms for 10 runs)


testing (2,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 30.88582420349121, CUB disabled: 98.8631362915039, numpy: 330.6220703125 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 30.954591751098633, CUB disabled: 97.02825927734375, numpy: 327.2279357910156 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 50.54518508911133, CUB disabled: 492.4873352050781, numpy: 782.5479736328125 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 50.530879974365234, CUB disabled: 490.4502868652344, numpy: 783.1376342773438 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 49.9156494140625, CUB disabled: 492.0494689941406, numpy: 782.93359375 (ms for 10 runs)

testing (2,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 49.92601776123047, CUB disabled: 490.24945068359375, numpy: 782.4859008789062 (ms for 10 runs)


testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 39.15907287597656, CUB disabled: 33.71004867553711, numpy: 315.7526550292969 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 38.656158447265625, CUB disabled: 33.73596954345703, numpy: 315.71954345703125 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 36.26723098754883, CUB disabled: 41.069664001464844, numpy: 690.4808349609375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 36.23664093017578, CUB disabled: 41.13612747192383, numpy: 689.9596557617188 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 36.09084701538086, CUB disabled: 40.29679870605469, numpy: 689.118896484375 (ms for 10 runs)

testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 36.07062530517578, CUB disabled: 40.30598449707031, numpy: 689.8553466796875 (ms for 10 runs)

Only in the case of testing (1, 2) + <class 'numpy.complex128'> + keepdims=True + sum ... is CUB marginally slower than the current reduction.

For a bigger array (shape = (64, 128, 128, 128)), CUB is always superior:

testing (3,) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 125.32758331298828, CUB disabled: 362.703857421875, numpy: 1027.164794921875 (ms for 10 runs)

testing (3,) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 123.53785705566406, CUB disabled: 360.95654296875, numpy: 1004.0859375 (ms for 10 runs)

testing (3,) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 124.47734069824219, CUB disabled: 887.1912841796875, numpy: 1439.5657958984375 (ms for 10 runs)

testing (3,) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 124.54975891113281, CUB disabled: 885.3906860351562, numpy: 1435.13671875 (ms for 10 runs)

testing (3,) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 124.43251037597656, CUB disabled: 886.9925537109375, numpy: 1428.9320068359375 (ms for 10 runs)

testing (3,) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 124.53241729736328, CUB disabled: 885.4948120117188, numpy: 1430.046630859375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 58.7993278503418, CUB disabled: 98.21481323242188, numpy: 810.4136352539062 (ms for 10 runs)

testing (2, 3) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 58.77036666870117, CUB disabled: 98.0830078125, numpy: 809.3361206054688 (ms for 10 runs)

testing (2, 3) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 58.769630432128906, CUB disabled: 114.79561614990234, numpy: 1239.73095703125 (ms for 10 runs)

testing (2, 3) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 58.808414459228516, CUB disabled: 114.75456237792969, numpy: 1246.2822265625 (ms for 10 runs)

testing (2, 3) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 58.781185150146484, CUB disabled: 114.70614624023438, numpy: 1258.4639892578125 (ms for 10 runs)

testing (2, 3) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 58.77888107299805, CUB disabled: 114.72930908203125, numpy: 1252.447509765625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 62.06032180786133, CUB disabled: 117.04822540283203, numpy: 811.9835815429688 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 61.999969482421875, CUB disabled: 117.04557037353516, numpy: 807.438232421875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 61.93446350097656, CUB disabled: 116.19235229492188, numpy: 1246.386474609375 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 62.0181770324707, CUB disabled: 116.21868896484375, numpy: 1243.169189453125 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 61.93008041381836, CUB disabled: 116.20188903808594, numpy: 1240.9537353515625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 61.923614501953125, CUB disabled: 116.22978973388672, numpy: 1253.59033203125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.int64'> + keepdims=False + sum ...
CUB enabled: 58.66387176513672, CUB disabled: 1606.14453125, numpy: 808.1597900390625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.int64'> + keepdims=True + sum ...
CUB enabled: 58.65580749511719, CUB disabled: 1605.620849609375, numpy: 804.99267578125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.int64'> + keepdims=False + max ...
CUB enabled: 58.674400329589844, CUB disabled: 1723.62841796875, numpy: 1236.4197998046875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.int64'> + keepdims=True + max ...
CUB enabled: 58.68755340576172, CUB disabled: 1707.755615234375, numpy: 1236.203369140625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.int64'> + keepdims=False + min ...
CUB enabled: 58.719295501708984, CUB disabled: 1723.857666015625, numpy: 1240.292236328125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.int64'> + keepdims=True + min ...
CUB enabled: 58.663902282714844, CUB disabled: 1707.478515625, numpy: 1243.048095703125 (ms for 10 runs)


testing (3,) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 221.85862731933594, CUB disabled: 344.1480407714844, numpy: 594.65234375 (ms for 10 runs)

testing (3,) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 221.785400390625, CUB disabled: 342.599365234375, numpy: 597.0825805664062 (ms for 10 runs)

testing (3,) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 233.88851928710938, CUB disabled: 476.7743225097656, numpy: 972.5061645507812 (ms for 10 runs)

testing (3,) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 233.6958465576172, CUB disabled: 474.9293518066406, numpy: 957.8511352539062 (ms for 10 runs)

testing (3,) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 233.73251342773438, CUB disabled: 477.0466003417969, numpy: 951.7782592773438 (ms for 10 runs)

testing (3,) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 233.74351501464844, CUB disabled: 475.034423828125, numpy: 957.5420532226562 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 29.1942081451416, CUB disabled: 85.63468933105469, numpy: 486.2149963378906 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 29.19913673400879, CUB disabled: 85.54694366455078, numpy: 488.6822509765625 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 29.189504623413086, CUB disabled: 104.91270446777344, numpy: 357.5321350097656 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 29.193439483642578, CUB disabled: 104.93097686767578, numpy: 355.3326721191406 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 29.203359603881836, CUB disabled: 104.93023681640625, numpy: 358.05670166015625 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 29.194047927856445, CUB disabled: 104.82134246826172, numpy: 357.87371826171875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 31.272480010986328, CUB disabled: 99.69491577148438, numpy: 492.3094177246094 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 31.282304763793945, CUB disabled: 99.73948669433594, numpy: 483.7795104980469 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 31.27280044555664, CUB disabled: 121.34288024902344, numpy: 348.9355773925781 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 31.293792724609375, CUB disabled: 121.29228973388672, numpy: 349.2859802246094 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 31.29132843017578, CUB disabled: 121.38758087158203, numpy: 348.71295166015625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 31.268096923828125, CUB disabled: 121.35801696777344, numpy: 349.07037353515625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float32'> + keepdims=False + sum ...
CUB enabled: 29.234783172607422, CUB disabled: 1289.862548828125, numpy: 482.09808349609375 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float32'> + keepdims=True + sum ...
CUB enabled: 29.21651268005371, CUB disabled: 1288.9176025390625, numpy: 484.14794921875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float32'> + keepdims=False + max ...
CUB enabled: 29.157920837402344, CUB disabled: 1695.0699462890625, numpy: 348.8479309082031 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float32'> + keepdims=True + max ...
CUB enabled: 29.13759994506836, CUB disabled: 1695.281494140625, numpy: 346.9366760253906 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float32'> + keepdims=False + min ...
CUB enabled: 29.15247917175293, CUB disabled: 1695.3565673828125, numpy: 344.9641418457031 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float32'> + keepdims=True + min ...
CUB enabled: 29.144832611083984, CUB disabled: 1695.2708740234375, numpy: 344.3422546386719 (ms for 10 runs)


testing (3,) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 119.23423767089844, CUB disabled: 337.9628601074219, numpy: 706.6064453125 (ms for 10 runs)

testing (3,) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 119.19430541992188, CUB disabled: 336.5941162109375, numpy: 717.451904296875 (ms for 10 runs)

testing (3,) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 127.32675170898438, CUB disabled: 881.11865234375, numpy: 1403.2318115234375 (ms for 10 runs)

testing (3,) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 127.33439636230469, CUB disabled: 879.3928833007812, numpy: 1407.1644287109375 (ms for 10 runs)

testing (3,) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 127.32816314697266, CUB disabled: 881.020751953125, numpy: 1401.936279296875 (ms for 10 runs)

testing (3,) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 127.3262710571289, CUB disabled: 879.4055786132812, numpy: 1395.1922607421875 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 58.80342483520508, CUB disabled: 92.93177795410156, numpy: 649.0916137695312 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 58.82255935668945, CUB disabled: 92.78524780273438, numpy: 650.1878662109375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 58.82179260253906, CUB disabled: 124.46988677978516, numpy: 676.6233520507812 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 58.78860855102539, CUB disabled: 124.54662322998047, numpy: 685.4388427734375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 58.801185607910156, CUB disabled: 124.47618865966797, numpy: 682.2925415039062 (ms for 10 runs)

testing (2, 3) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 58.819583892822266, CUB disabled: 124.40847778320312, numpy: 682.3870849609375 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 62.112064361572266, CUB disabled: 110.46361541748047, numpy: 648.2755737304688 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 62.1157112121582, CUB disabled: 110.44710540771484, numpy: 647.8474731445312 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 61.95676803588867, CUB disabled: 130.0491485595703, numpy: 670.53662109375 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 61.957889556884766, CUB disabled: 130.02613830566406, numpy: 671.9038696289062 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 61.96745681762695, CUB disabled: 129.98297119140625, numpy: 668.9396362304688 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 61.955135345458984, CUB disabled: 130.00575256347656, numpy: 665.1917724609375 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float64'> + keepdims=False + sum ...
CUB enabled: 58.73561477661133, CUB disabled: 1483.6192626953125, numpy: 640.7586669921875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float64'> + keepdims=True + sum ...
CUB enabled: 58.71760177612305, CUB disabled: 1483.070556640625, numpy: 640.892578125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float64'> + keepdims=False + max ...
CUB enabled: 58.604736328125, CUB disabled: 1816.6575927734375, numpy: 670.8892211914062 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float64'> + keepdims=True + max ...
CUB enabled: 58.60236740112305, CUB disabled: 1816.730712890625, numpy: 667.248779296875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float64'> + keepdims=False + min ...
CUB enabled: 58.622398376464844, CUB disabled: 1816.788818359375, numpy: 661.8549194335938 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.float64'> + keepdims=True + min ...
CUB enabled: 58.627967834472656, CUB disabled: 1816.9625244140625, numpy: 664.1939086914062 (ms for 10 runs)


testing (3,) + <class 'numpy.complex64'> + keepdims=False + sum ...
test_CUB_reduction_segmented3.py:44: PerformanceWarning: CUB reduction for complex numbers may not be highly performant. If concerned, set cupy.cuda.cub_enabled=False to switch to CuPy's internal reduction routine and compare the timings.
  y = getattr(x, func)(axis=axis, keepdims=keepdims)
CUB enabled: 118.09037017822266, CUB disabled: 344.164794921875, numpy: 1076.6517333984375 (ms for 10 runs)

testing (3,) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 117.12425231933594, CUB disabled: 342.46142578125, numpy: 1077.8826904296875 (ms for 10 runs)

testing (3,) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 240.08023071289062, CUB disabled: 1141.5780029296875, numpy: 3310.99755859375 (ms for 10 runs)

testing (3,) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 240.1723175048828, CUB disabled: 1139.888427734375, numpy: 3331.254638671875 (ms for 10 runs)

testing (3,) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 263.5901794433594, CUB disabled: 1129.4566650390625, numpy: 3308.30908203125 (ms for 10 runs)

testing (3,) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 263.5737609863281, CUB disabled: 1127.2109375, numpy: 3321.3876953125 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 63.87919998168945, CUB disabled: 92.51001739501953, numpy: 970.0470581054688 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 63.91993713378906, CUB disabled: 92.46937561035156, numpy: 972.5693969726562 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 63.5074577331543, CUB disabled: 161.8429412841797, numpy: 2616.058349609375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 63.49302291870117, CUB disabled: 161.74496459960938, numpy: 2626.90771484375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 63.56777572631836, CUB disabled: 160.23989868164062, numpy: 2629.412841796875 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 63.571136474609375, CUB disabled: 160.232421875, numpy: 2624.2470703125 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 74.21743774414062, CUB disabled: 109.87286376953125, numpy: 967.4530029296875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 74.20419311523438, CUB disabled: 109.93087768554688, numpy: 970.506591796875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 78.08457946777344, CUB disabled: 161.56326293945312, numpy: 2618.248779296875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 78.14934539794922, CUB disabled: 161.5684814453125, numpy: 2601.34619140625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 74.52435302734375, CUB disabled: 153.03805541992188, numpy: 2604.09716796875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 74.56649780273438, CUB disabled: 153.05775451660156, numpy: 2602.500732421875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + sum ...
CUB enabled: 64.68415832519531, CUB disabled: 1488.3746337890625, numpy: 962.0068969726562 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex64'> + keepdims=True + sum ...
CUB enabled: 64.64669036865234, CUB disabled: 1487.966796875, numpy: 970.979248046875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + max ...
CUB enabled: 65.50601959228516, CUB disabled: 2636.79345703125, numpy: 2607.332275390625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex64'> + keepdims=True + max ...
CUB enabled: 65.61023712158203, CUB disabled: 2636.115966796875, numpy: 2632.875732421875 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex64'> + keepdims=False + min ...
CUB enabled: 65.39942169189453, CUB disabled: 2414.94873046875, numpy: 2636.8720703125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex64'> + keepdims=True + min ...
CUB enabled: 65.47216033935547, CUB disabled: 2414.40576171875, numpy: 2612.00244140625 (ms for 10 runs)


testing (3,) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 163.25686645507812, CUB disabled: 373.29290771484375, numpy: 1425.5550537109375 (ms for 10 runs)

testing (3,) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 163.14767456054688, CUB disabled: 371.617919921875, numpy: 1421.293212890625 (ms for 10 runs)

testing (3,) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 285.5564880371094, CUB disabled: 1857.4620361328125, numpy: 3464.55029296875 (ms for 10 runs)

testing (3,) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 285.4814758300781, CUB disabled: 1855.87158203125, numpy: 3462.031005859375 (ms for 10 runs)

testing (3,) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 283.581787109375, CUB disabled: 1857.7886962890625, numpy: 3462.5654296875 (ms for 10 runs)

testing (3,) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 283.5067138671875, CUB disabled: 1856.091064453125, numpy: 3483.1630859375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 124.05891418457031, CUB disabled: 134.43359375, numpy: 1274.07763671875 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 124.21289825439453, CUB disabled: 134.4215087890625, numpy: 1277.5452880859375 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 126.35359954833984, CUB disabled: 207.81475830078125, numpy: 2783.3017578125 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 126.45206451416016, CUB disabled: 207.38307189941406, numpy: 2790.884033203125 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 125.79535675048828, CUB disabled: 205.54566955566406, numpy: 2793.48046875 (ms for 10 runs)

testing (2, 3) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 125.81260681152344, CUB disabled: 205.76588439941406, numpy: 2795.9384765625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 147.2489013671875, CUB disabled: 160.36729431152344, numpy: 1262.2879638671875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 147.2230682373047, CUB disabled: 160.50314331054688, numpy: 1266.602294921875 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 162.27130126953125, CUB disabled: 189.70233154296875, numpy: 2755.3515625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 162.20582580566406, CUB disabled: 189.5966033935547, numpy: 2761.160400390625 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 151.7519989013672, CUB disabled: 184.3164825439453, numpy: 2757.995849609375 (ms for 10 runs)

testing (1, 2, 3) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 151.6597137451172, CUB disabled: 184.361083984375, numpy: 2764.547119140625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + sum ...
CUB enabled: 125.22147369384766, CUB disabled: 1612.7156982421875, numpy: 1259.2164306640625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex128'> + keepdims=True + sum ...
CUB enabled: 125.32195281982422, CUB disabled: 1623.6519775390625, numpy: 1258.6700439453125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + max ...
CUB enabled: 126.17142486572266, CUB disabled: 3043.509521484375, numpy: 2754.956298828125 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex128'> + keepdims=True + max ...
CUB enabled: 126.16226959228516, CUB disabled: 3024.506103515625, numpy: 2753.31494140625 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex128'> + keepdims=False + min ...
CUB enabled: 125.57488250732422, CUB disabled: 2848.511474609375, numpy: 2755.859375 (ms for 10 runs)

testing (0, 1, 2, 3) + <class 'numpy.complex128'> + keepdims=True + min ...
CUB enabled: 125.5580825805664, CUB disabled: 2890.471435546875, numpy: 2759.63427734375 (ms for 10 runs)

I suppose this suggests we can take away the performance warning? Could you verify this on P/V100 @emcastillo?

We still need to discuss #2562 (comment) though.

@leofang leofang changed the title Refactor CUB to support an explicit axis argument Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types Oct 31, 2019
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 31, 2019

Btw, fixing the alignment issue also helps speed up CUB device reduce (no axis) for complex numbers, added in #2538.

@emcastillo
Copy link
Copy Markdown
Member

Awesome work,
We dont need to do a warning or update documentation with these numbers. :)

I will do the check on V100 later

emcastillo
emcastillo previously approved these changes Nov 1, 2019
Copy link
Copy Markdown
Member

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the warning :)

@emcastillo emcastillo dismissed their stale review November 1, 2019 02:42

not complete

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Nov 1, 2019

@emcastillo Thanks. Removed!

@emcastillo
Copy link
Copy Markdown
Member

Jenkins, test this please

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit cabb0ca, target branch master) succeeded!

@emcastillo emcastillo merged commit 84a5194 into cupy:master Nov 1, 2019
@emcastillo emcastillo added the cat:performance Performance in terms of speed or memory consumption label Nov 1, 2019
@emcastillo emcastillo added this to the v7.0.0 milestone Nov 1, 2019
@leofang leofang deleted the cub_enhance branch November 1, 2019 05:07
@leofang leofang mentioned this pull request Jan 15, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:performance Performance in terms of speed or memory consumption

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants