Skip to content

Add cufftXtMakePlanMany and cufftXtExec#4407

Merged
takagi merged 12 commits intocupy:masterfrom
leofang:cufftXt
Jan 7, 2021
Merged

Add cufftXtMakePlanMany and cufftXtExec#4407
takagi merged 12 commits intocupy:masterfrom
leofang:cufftXt

Conversation

@leofang
Copy link
Member

@leofang leofang commented Dec 4, 2020

Part of #4406 and #3370. Work in progress. Will update the PR description later.

UPDATE: This PR wraps cufftXtMakePlanMany and cufftXtExec in a new kind of cuFFT plan XtPlanNd, so that it can be constructed manually. The purpose of supporting cufftXtMakePlanMany is twofold:

  1. Support 64-bit indexing (Support FFT with more than 2^31 elements #4406)
  2. Support exotic types such as complex32 (Support for half-precision complex numbers? #3370) and possibly bf16.

See the discussion in this PR (and also in #4406) for why we don't integrate it with the existing PlanNd. In the next PR I'll try to integrate XtPlanNd with the high-level APIs to address #4406.

This PR does not work on HIP.

===End of UPDATE===

A very preliminary test shows it's possible to do half-precision complex FFT (with time reduced by a half)!

import cupy as cp
import numpy as np
from cupyx.time import repeat


shape = (1024, 256, 256)
dtypes = (cp.complex64, cp.complex128, 'E')  # 'E' = cp.complex32

for t in dtypes:
    if t == 'E':
        dtype = cp.float16
        old_shape = shape
        shape = (shape[0], shape[1], 2*shape[2])  # complex32 has two fp16
    else:
        dtype = t
    idtype = odtype = edtype = np.dtype(t) if t != 'E' else t
    a = cp.random.random(shape).astype(dtype)
    out = cp.empty_like(a)
    if t == 'E':
        shape = old_shape
    plan = cp.cuda.cufft.XtPlanNd(shape[1:],
                                  shape[1:], 1, shape[1]*shape[2], idtype,
                                  shape[1:], 1, shape[1]*shape[2], odtype,
                                  shape[0], edtype,
                                  order='C', last_axis=-1, last_size=None)
    print(repeat(plan.fft, (a, out, cp.cuda.cufft.CUFFT_FORWARD), n_repeat=100))
    plan.fft(a, out, cp.cuda.cufft.CUFFT_FORWARD)
    if t != 'E':
        out_np = np.fft.fftn(cp.asnumpy(a), axes=(-2,-1))
        #print(t, 'ok' if cp.allclose(out, out_np, rtol=1E-3) else 'not ok')
    else:
        a_np = cp.asnumpy(a).astype(np.float32)  # upcast
        a_np = a_np.view(np.complex64)
        out_np = np.fft.fftn(a_np, axes=(-2,-1))
        out_np = np.ascontiguousarray(out_np).astype(np.complex64)  # downcast
        out_np = out_np.view(np.float32)
        out_np = out_np.astype(np.float16)
        ##print(t, 'ok' if cp.allclose(out, out_np, atol=1E-2) else 'not ok')
        # don't worry about accruacy for now, as we probably lost a lot during casting
        print(t, 'ok' if cp.mean(cp.abs(out - cp.asarray(out_np))) < 0.1 else 'not ok')

Output (CUDA 10.2 + GTX 2080 Ti):

fft                 :    CPU:    5.013 us   +/- 0.214 (min:    4.811 / max:    6.541) us     GPU-0: 4290.127 us   +/-18.198 (min: 4240.768 / max: 4329.792) us
fft                 :    CPU:    5.066 us   +/- 0.243 (min:    4.805 / max:    6.668) us     GPU-0:17339.548 us   +/-27.822 (min:17324.127 / max:17452.127) us
fft                 :    CPU:    4.930 us   +/- 0.197 (min:    4.679 / max:    6.180) us     GPU-0: 2192.620 us   +/- 9.862 (min: 2165.984 / max: 2204.448) us
E ok

cc: @carterbox @grlee77

@carterbox
Copy link
Contributor

carterbox commented Dec 4, 2020

GTX 1050 Ti
Compute Capability: 61
CUDA Version: 11.1

<class 'numpy.complex128'>
CPU:        27.964 us   +/-15.416 (min:   12.679 / max:   66.799) us 
GPU-0: 098,351.780 us   +/-18.624 (min:98321.053 / max:98400.253) us

<class 'numpy.complex64'>
CPU:        14.488 us   +/- 7.084 (min:   11.268 / max:   54.220) us
GPU-0: 022,074.731 us   +/-45.726 (min:22031.136 / max:22353.920) us

<class 'numpy.complex32'>
CPU:        29.874 us   +/-12.739 (min:   14.001 / max:   80.358) us
GPU-0: 182,748.333 us   +/-13.548 (min:182727.615 / max:182788.101) us
Quadro P4000
Compute Capability: 61
CUDA Version: 11.0

<class 'numpy.complex128'>
CPU:       11.993 us   +/-  9.071 (min:    8.901 / max:   77.487) us
GPU-0: 44,388.763 us   +/-342.440 (min:44291.073 / max:46947.712) us

<class 'numpy.complex64'>
CPU:       10.256 us   +/-  2.494 (min:    8.689 / max:   28.082) us
GPU-0: 10,651.921 us   +/- 20.315 (min:10634.816 / max:10778.272) us

<class 'numpy.complex32'>
CPU:       11.335 us   +/-  6.400 (min:    9.096 / max:   66.276) us
GPU-0: 82,250.319 us   +/- 70.232 (min:82195.068 / max:82420.227) us
RTX 2080 Ti
Compute Capability: 75
CUDA VERSION: 11.0

<class 'numpy.complex128'>
CPU:       13.620 us   +/- 10.708 (min:    8.535 / max:   88.484) us
GPU-0: 17,430.332 us   +/-703.331 (min:17186.144 / max:24399.967) us

<class 'numpy.complex64'>
CPU:        9.553 us   +/- 1.079 (min:     8.302 / max:   15.529) us
GPU-0:  4,647.044 us   +/-58.877 (min:  4426.432 / max: 4726.304) us

<class 'numpy.complex32'>
CPU:       11.576 us   +/- 1.813 (min:    10.241 / max:   20.834) us
GPU-0:  2,436.826 us   +/-73.310 (min:  2236.416 / max: 2549.760) us

@leofang
Copy link
Member Author

leofang commented Dec 4, 2020

Thanks for quick tests, @carterbox! Which CUDA version are you on? I tested it on CUDA 10.2 + GTX 2080 Ti.

@carterbox
Copy link
Contributor

It seems that 16-bit floating point was added in Pascal (6.x), but from my tests they don't perform well until Turing (7.5).

The speed-up for our FFT test is only 2x compared with 32-bit floats whereas the speed-up to 32-bit from 64-bit floats is 4x.

@carterbox
Copy link
Contributor

I've only read the introduction of this paper by Ho and Wong (2017). It seems that 2x is the maximum expected speed-up and it requires that pairs of 16-bit floats are processed with the same instruction because 16-bit floats are processed with the existing 32-bit compute units. This is very amenable to complex types whose operations are inherently paired.

However, it does not explain why I observed slow-downs on Pascal. Maybe there is a missing compile flag?

@leofang
Copy link
Member Author

leofang commented Dec 5, 2020

CUDA 10.2 + V100 (CC 7.0):

fft (complex128)   :    CPU:    6.660 us   +/- 0.238 (min:    6.340 / max:    8.180) us     GPU-0: 7260.723 us   +/-41.022 (min: 7254.016 / max: 7668.736) us
fft (complex64)    :    CPU:    6.555 us   +/- 0.236 (min:    6.160 / max:    7.890) us     GPU-0: 3639.777 us   +/- 2.703 (min: 3634.176 / max: 3647.488) us
fft (complex32)    :    CPU:    6.445 us   +/- 0.223 (min:    6.190 / max:    8.060) us     GPU-0: 2087.393 us   +/- 3.978 (min: 2077.696 / max: 2098.176) us

@leofang
Copy link
Member Author

leofang commented Dec 5, 2020

No, I don't think there's any compile flag missing. Everything used here is just host API, and we simply link the Python module to cuFFT. I guess it's more like Pascal is not well optimized (despite it has the capability for a potential 2x speedup), but I need to check further. Note that on V100 the speedup is less than 2x for complex64 -> complex32, but it's probably because the workload is too small compared to the hardware capacity (ex, I got only 2x, not 4x, for complex128 -> complex64).

@leofang
Copy link
Member Author

leofang commented Dec 10, 2020

@takagi Question: performance issue aside (which I already bugged an NVIDIA friend to look into), is the current design acceptable? As mentioned in #4406, I prefer to switch to use cufftXtMakePlanMany only as needed (ex: dealing with oversize arrays or exotic dtypes), as it's unclear to me if it'd incur additional overhead compared to using plans from cufftMakePlanMany. Therefore, it's isolated in a new XtPlanNd cdef class, with slightly different constructor arguments.

If the design is OK, I'll add a few tests for XtPlanNd. We currently don't test Plan1d or PlanNd directly, but it's because they are fully covered by the high-level tests. Given that it'd take a while to fully resolve #4406, I'd like to have a few simple tests around.

Thanks!

@leofang
Copy link
Member Author

leofang commented Dec 14, 2020

@carterbox: @maxpkatz figured it out the reason for the poor performance you observed: Unlike CC 6.0 (ex: P100), CC 6.1/6.2 do not have accelerated fp16 hardware support. According to the specs I found online for 1080 Ti and Quadro P4000 (oddly I can't find any detailed spec on NVIDIA's website...😞), they have 1:64 ratio to fp32, so your tests actually indicated that the problem size saturated your GPUs for fp16, as the slowdowns were far less than 64x.

1080 Ti: https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877
P4000: https://www.techpowerup.com/gpu-specs/quadro-p4000.c2930

@carterbox
Copy link
Contributor

carterbox commented Dec 14, 2020

Thanks for linking that database; I wouldn't have thought to look there. I'm also confused about why NVidia would release hardware that doesn't do fp16 at least as fast fp32 i.e. (1:1)

This is also a good resource because fp16 support isn't getting better with all newer cards. i.e. the 3090 is only 1:1 whereas the 2080 is 2:1.

@leofang
Copy link
Member Author

leofang commented Dec 15, 2020

This is also a good resource because fp16 support isn't getting better with all newer cards. i.e. the 3090 is only 1:1 whereas the 2080 is 2:1.

This is an interesting question and I don't know 😄 But if the throughput is not improved, at least we'd save some memory...

@takagi
Copy link
Contributor

takagi commented Dec 28, 2020

Sorry, I dropped your comment, @leofang. Yes, using cufftMakePlanMany only when it is needed also looks reasonable to me.

@leofang
Copy link
Member Author

leofang commented Dec 28, 2020

Yes, using cufftMakePlanMany only when it is needed also looks reasonable to me.

Thanks, @takagi! I suppose you meant cufftXtMakePlanMany 😁 OK, I'll then proceed to add a few simple tests.

@leofang leofang changed the title [WIP] Add cufftXtMakePlanMany and cufftXtExec Add cufftXtMakePlanMany and cufftXtExec Dec 28, 2020
@leofang leofang marked this pull request as ready for review December 28, 2020 17:02
@leofang
Copy link
Member Author

leofang commented Dec 28, 2020

@takagi Tests added, PTAL.

@leofang
Copy link
Member Author

leofang commented Dec 28, 2020

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit d3e221e, target branch master) succeeded!

@leofang
Copy link
Member Author

leofang commented Dec 28, 2020

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 3a3cca7, target branch master) succeeded!

@takagi takagi added this to the v9.0.0b2 milestone Jan 7, 2021
@takagi
Copy link
Contributor

takagi commented Jan 7, 2021

LGTM! Thanks!

@takagi takagi merged commit c08ae1f into cupy:master Jan 7, 2021
@leofang leofang deleted the cufftXt branch January 7, 2021 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants