Add cufftXtMakePlanMany and cufftXtExec#4407
Conversation
|
|
Thanks for quick tests, @carterbox! Which CUDA version are you on? I tested it on CUDA 10.2 + GTX 2080 Ti. |
|
It seems that 16-bit floating point was added in Pascal (6.x), but from my tests they don't perform well until Turing (7.5). The speed-up for our FFT test is only 2x compared with 32-bit floats whereas the speed-up to 32-bit from 64-bit floats is 4x. |
|
I've only read the introduction of this paper by Ho and Wong (2017). It seems that 2x is the maximum expected speed-up and it requires that pairs of 16-bit floats are processed with the same instruction because 16-bit floats are processed with the existing 32-bit compute units. This is very amenable to complex types whose operations are inherently paired. However, it does not explain why I observed slow-downs on Pascal. Maybe there is a missing compile flag? |
|
CUDA 10.2 + V100 (CC 7.0): |
|
No, I don't think there's any compile flag missing. Everything used here is just host API, and we simply link the Python module to cuFFT. I guess it's more like Pascal is not well optimized (despite it has the capability for a potential 2x speedup), but I need to check further. Note that on V100 the speedup is less than 2x for complex64 -> complex32, but it's probably because the workload is too small compared to the hardware capacity (ex, I got only 2x, not 4x, for complex128 -> complex64). |
|
@takagi Question: performance issue aside (which I already bugged an NVIDIA friend to look into), is the current design acceptable? As mentioned in #4406, I prefer to switch to use If the design is OK, I'll add a few tests for Thanks! |
|
@carterbox: @maxpkatz figured it out the reason for the poor performance you observed: Unlike CC 6.0 (ex: P100), CC 6.1/6.2 do not have accelerated fp16 hardware support. According to the specs I found online for 1080 Ti and Quadro P4000 (oddly I can't find any detailed spec on NVIDIA's website...😞), they have 1:64 ratio to fp32, so your tests actually indicated that the problem size saturated your GPUs for fp16, as the slowdowns were far less than 64x. 1080 Ti: https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877 |
|
Thanks for linking that database; I wouldn't have thought to look there. I'm also confused about why NVidia would release hardware that doesn't do fp16 at least as fast fp32 i.e. (1:1) This is also a good resource because fp16 support isn't getting better with all newer cards. i.e. the 3090 is only 1:1 whereas the 2080 is 2:1. |
This is an interesting question and I don't know 😄 But if the throughput is not improved, at least we'd save some memory... |
|
Sorry, I dropped your comment, @leofang. Yes, using |
Thanks, @takagi! I suppose you meant |
cufftXtMakePlanMany and cufftXtExeccufftXtMakePlanMany and cufftXtExec
|
@takagi Tests added, PTAL. |
|
Jenkins, test this please |
|
Jenkins CI test (for commit d3e221e, target branch master) succeeded! |
|
Jenkins, test this please |
|
Jenkins CI test (for commit 3a3cca7, target branch master) succeeded! |
|
LGTM! Thanks! |
Part of #4406 and #3370.
Work in progress. Will update the PR description later.UPDATE: This PR wraps
cufftXtMakePlanManyandcufftXtExecin a new kind of cuFFT planXtPlanNd, so that it can be constructed manually. The purpose of supportingcufftXtMakePlanManyis twofold:See the discussion in this PR (and also in #4406) for why we don't integrate it with the existing
PlanNd. In the next PR I'll try to integrateXtPlanNdwith the high-level APIs to address #4406.This PR does not work on HIP.
===End of UPDATE===
A very preliminary test shows it's possible to do half-precision complex FFT (with time reduced by a half)!
Output (CUDA 10.2 + GTX 2080 Ti):
cc: @carterbox @grlee77