Issue
Cupy streams are not supporting concurrent GPU operation, for cp.linalg.svd. (This is my first time using Cupy, to try and do concurrent SVDs on the GPU with a stack of matrices.)
Copying @mrocklin and @seibert as they seem to have spent a lot of time with similar issues.
Related Links
For background on this, see:
Conditions:
CuPy Version : 7.2.0
CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
CUDA Build Version : 10010
CUDA Driver Version : 10010
CUDA Runtime Version : 10010
cuBLAS Version : 10200
cuFFT Version : 10010
cuRAND Version : 10010
cuSOLVER Version : (10, 1, 0)
cuSPARSE Version : 10010
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : None
NCCL Runtime Version : None
Code to reproduce [edit: fixed a bug and a couple of bad comments]
import time
import cupy as cp
import numpy as np
import dask
import dask.array as da
def many_svd_np_vs_cp():
device = cp.cuda.Device()
N = 16 # number of desired SVDs, grouped.
M = 1024 # size of each matrix, for SVD (MxM)
A = np.asarray(np.random.randn(N, M, M), dtype=np.float32)
# ----- Prime Pump, to eliminate CUDA overhead in timings. -----
A_gpu = cp.asarray(A)
for i in range(16):
sg = cp.linalg.svd(A_gpu[0], compute_uv=False)
time.sleep(0.25) # to separate this, in nvvp
# ----- Grouped SVDs in numpy -----
tm = time.time()
s_npall = np.linalg.svd(A, compute_uv=False) # 256 x 16
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Numpy', elaps))
# ----- Cupy-Loop: grouped SVDs in cupy -----
sg_all = cp.asarray([])
tm = time.time()
for i in range(A_gpu.shape[0]):
sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
sg_all = cp.concatenate((sg_all, sg), axis=0) # N*16 = 4096, but that's OK
s_cpall = cp.asnumpy(sg_all)
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-Loop', elaps))
time.sleep(0.20)
# ----- Cupy-ListComp: is List Comprehension Faster? -----
sg_all = cp.asarray([])
tm = time.time()
sg_all = [cp.linalg.svd(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
s_cpall = cp.asnumpy(sg_all)
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-ListComp', elaps))
time.sleep(0.20)
# ----- Cupy-Dask-Delayed: try using Dask.Delayed for parallelism/concurrency -----
# TODO: not currently trying to retrieve the results, with this example.
tm = time.time()
tasks = [ dask.delayed(cp.linalg.svd)(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
tasks_list = dask.delayed( list(tasks) )
res = dask.compute(tasks_list) # Does return a list of 256 x 16
device.synchronize()
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-Dask-Delayed', elaps))
time.sleep(0.20)
# ----- Cupy-Streams: try cupy streams for paralellism/concurrency -----
# TODO: not currently trying to retrieve the results, with this example.
device = cp.cuda.Device()
map_streams = [cp.cuda.stream.Stream() for i in range(N)]
tm = time.time() # BUG: was start_time = time.time()
for i, stream in enumerate(map_streams):
with stream:
sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
# This is a little worse:
# C_gpu = cp.asarray(np.random.randn(M, M), dtype=np.float32)
# sg = cp.linalg.svd(C_gpu, compute_uv=False)
device.synchronize()
elaps = time.time() - tm
print('%20s: elaps=%f' % ('Cupy-Streams', elaps))
if __name__ == "__main__":
many_svd_np_vs_cp()
Output:
Numpy: elaps=2.181430
Cupy-Loop: elaps=1.396355
Cupy-ListComp: elaps=1.467271
Cupy-Dask-Delayed: elaps=1.206578
Cupy-Streams: elaps=~1.6 (was 3.104342 with bug)
NVVP:
Running in NVVP, here are two example of what the stream launches look like, for (N,M) = (64,256) and (N,M) = (16,1024), where N are the number of SVDs/streams, and the matrices are size MxM. Each SVD is taking 23 ms (or 116 ms in the second case), which clearly is enough time to try and launch them concurrently. Using 'C_gpu' instead of A_gpu[i] doesn't make a difference.

Issue
Cupy streams are not supporting concurrent GPU operation, for cp.linalg.svd. (This is my first time using Cupy, to try and do concurrent SVDs on the GPU with a stack of matrices.)
Copying @mrocklin and @seibert as they seem to have spent a lot of time with similar issues.
Related Links
For background on this, see:
Conditions:
CuPy Version : 7.2.0
CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
CUDA Build Version : 10010
CUDA Driver Version : 10010
CUDA Runtime Version : 10010
cuBLAS Version : 10200
cuFFT Version : 10010
cuRAND Version : 10010
cuSOLVER Version : (10, 1, 0)
cuSPARSE Version : 10010
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : None
NCCL Runtime Version : None
Code to reproduce [edit: fixed a bug and a couple of bad comments]
Output:
NVVP:
Running in NVVP, here are two example of what the stream launches look like, for (N,M) = (64,256) and (N,M) = (16,1024), where N are the number of SVDs/streams, and the matrices are size MxM. Each SVD is taking 23 ms (or 116 ms in the second case), which clearly is enough time to try and launch them concurrently. Using 'C_gpu' instead of A_gpu[i] doesn't make a difference.