Skip to content

Cupy streams are not supporting concurrent GPU operation (cp.linalg.svd) #3174

@drcdr

Description

@drcdr

Issue
Cupy streams are not supporting concurrent GPU operation, for cp.linalg.svd. (This is my first time using Cupy, to try and do concurrent SVDs on the GPU with a stack of matrices.)

Copying @mrocklin and @seibert as they seem to have spent a lot of time with similar issues.

Related Links
For background on this, see:

Conditions:
CuPy Version : 7.2.0
CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
CUDA Build Version : 10010
CUDA Driver Version : 10010
CUDA Runtime Version : 10010
cuBLAS Version : 10200
cuFFT Version : 10010
cuRAND Version : 10010
cuSOLVER Version : (10, 1, 0)
cuSPARSE Version : 10010
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : None
NCCL Runtime Version : None

Code to reproduce [edit: fixed a bug and a couple of bad comments]

import time
import cupy as cp
import numpy as np
import dask
import dask.array as da

def many_svd_np_vs_cp():
    device = cp.cuda.Device()

    N = 16      # number of desired SVDs, grouped.
    M = 1024  # size of each matrix, for SVD (MxM)
    A = np.asarray(np.random.randn(N, M, M), dtype=np.float32)
    
    # ----- Prime Pump, to eliminate CUDA overhead in timings. ----- 
    A_gpu = cp.asarray(A)
    for i in range(16):  
        sg = cp.linalg.svd(A_gpu[0], compute_uv=False)
    time.sleep(0.25)  # to separate this, in nvvp

    # ----- Grouped SVDs in numpy ----- 
    tm = time.time()
    s_npall = np.linalg.svd(A, compute_uv=False)  # 256 x 16
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Numpy', elaps))

    # ----- Cupy-Loop: grouped SVDs in cupy ----- 
    sg_all = cp.asarray([])
    tm = time.time()
    for i in range(A_gpu.shape[0]):
        sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
        sg_all = cp.concatenate((sg_all, sg), axis=0) # N*16 = 4096, but that's OK
    s_cpall = cp.asnumpy(sg_all)
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Loop', elaps))
    time.sleep(0.20)

    # ----- Cupy-ListComp: is List Comprehension Faster? -----
    sg_all = cp.asarray([])
    tm = time.time()
    sg_all = [cp.linalg.svd(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
    s_cpall = cp.asnumpy(sg_all)
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-ListComp', elaps))
    time.sleep(0.20)

    # ----- Cupy-Dask-Delayed: try using Dask.Delayed for parallelism/concurrency -----
    # TODO: not currently trying to retrieve the results, with this example.
    tm = time.time()
    tasks = [ dask.delayed(cp.linalg.svd)(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
    tasks_list = dask.delayed( list(tasks) )
    res = dask.compute(tasks_list)  # Does return a list of 256 x 16
    device.synchronize()
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Dask-Delayed', elaps))
    time.sleep(0.20)

    # ----- Cupy-Streams: try cupy streams for paralellism/concurrency -----
    # TODO: not currently trying to retrieve the results, with this example.
    device = cp.cuda.Device()
    map_streams = [cp.cuda.stream.Stream() for i in range(N)]
    tm = time.time()  # BUG: was start_time = time.time()
    for i, stream in enumerate(map_streams):
        with stream:
            sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
            # This is a little worse:
            # C_gpu = cp.asarray(np.random.randn(M, M), dtype=np.float32) 
            # sg = cp.linalg.svd(C_gpu, compute_uv=False)
    device.synchronize()
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Streams', elaps))

if __name__ == "__main__":
    many_svd_np_vs_cp()


Output:

               Numpy: elaps=2.181430
           Cupy-Loop: elaps=1.396355
  Cupy-ListComp: elaps=1.467271
   Cupy-Dask-Delayed: elaps=1.206578
        Cupy-Streams: elaps=~1.6 (was 3.104342 with bug)

NVVP:
Running in NVVP, here are two example of what the stream launches look like, for (N,M) = (64,256) and (N,M) = (16,1024), where N are the number of SVDs/streams, and the matrices are size MxM. Each SVD is taking 23 ms (or 116 ms in the second case), which clearly is enough time to try and launch them concurrently. Using 'C_gpu' instead of A_gpu[i] doesn't make a difference.

nvvp_cupy_streams

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions