Cupy streams are not supporting concurrent GPU operation (cp.linalg.svd)

**_Issue_**
Cupy streams are not supporting concurrent GPU operation, for cp.linalg.svd.  (This is my first time using Cupy, to try and do concurrent SVDs on the GPU with a stack of matrices.)

Copying @mrocklin and @seibert as they seem to have spent a lot of time with similar issues.

**_Related Links_**
For background on this, see:

- #1695: How to get concurrency from cupy Streams. (Closed with a change, to support longer operations, but that doesn't seem to apply in this case.)
- [Dask 4040](https://github.com/dask/dask/issues/4040): Issues getting dask.delayed to work concurrently
- example, map_reduce : [link](https://github.com/cupy/cupy/blob/master/examples/stream/map_reduce.py)
- example, concurrency in Cupy with streams: [link](https://gist.github.com/mrocklin/d3b70cea6a555ae2387556e4f0808ac1) 
- [Related Stackoverflow question](https://stackoverflow.com/questions/55516973/apply-cupy-linalg-svd-over-a-stack-of-matrices/55528478) (suggests trying gesvdjBatched, in C++)

**_Conditions:_**
CuPy Version : 7.2.0
CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
CUDA Build Version : 10010
CUDA Driver Version : 10010
CUDA Runtime Version : 10010
cuBLAS Version : 10200
cuFFT Version : 10010
cuRAND Version : 10010
cuSOLVER Version : (10, 1, 0)
cuSPARSE Version : 10010
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : None
NCCL Runtime Version : None

**_Code to reproduce_**  [edit: fixed a bug and a couple of bad comments]

```
import time
import cupy as cp
import numpy as np
import dask
import dask.array as da

def many_svd_np_vs_cp():
    device = cp.cuda.Device()

    N = 16      # number of desired SVDs, grouped.
    M = 1024  # size of each matrix, for SVD (MxM)
    A = np.asarray(np.random.randn(N, M, M), dtype=np.float32)
    
    # ----- Prime Pump, to eliminate CUDA overhead in timings. ----- 
    A_gpu = cp.asarray(A)
    for i in range(16):  
        sg = cp.linalg.svd(A_gpu[0], compute_uv=False)
    time.sleep(0.25)  # to separate this, in nvvp

    # ----- Grouped SVDs in numpy ----- 
    tm = time.time()
    s_npall = np.linalg.svd(A, compute_uv=False)  # 256 x 16
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Numpy', elaps))

    # ----- Cupy-Loop: grouped SVDs in cupy ----- 
    sg_all = cp.asarray([])
    tm = time.time()
    for i in range(A_gpu.shape[0]):
        sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
        sg_all = cp.concatenate((sg_all, sg), axis=0) # N*16 = 4096, but that's OK
    s_cpall = cp.asnumpy(sg_all)
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Loop', elaps))
    time.sleep(0.20)

    # ----- Cupy-ListComp: is List Comprehension Faster? -----
    sg_all = cp.asarray([])
    tm = time.time()
    sg_all = [cp.linalg.svd(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
    s_cpall = cp.asnumpy(sg_all)
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-ListComp', elaps))
    time.sleep(0.20)

    # ----- Cupy-Dask-Delayed: try using Dask.Delayed for parallelism/concurrency -----
    # TODO: not currently trying to retrieve the results, with this example.
    tm = time.time()
    tasks = [ dask.delayed(cp.linalg.svd)(A_gpu[i], compute_uv=False) for i in range(A_gpu.shape[0])]
    tasks_list = dask.delayed( list(tasks) )
    res = dask.compute(tasks_list)  # Does return a list of 256 x 16
    device.synchronize()
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Dask-Delayed', elaps))
    time.sleep(0.20)

    # ----- Cupy-Streams: try cupy streams for paralellism/concurrency -----
    # TODO: not currently trying to retrieve the results, with this example.
    device = cp.cuda.Device()
    map_streams = [cp.cuda.stream.Stream() for i in range(N)]
    tm = time.time()  # BUG: was start_time = time.time()
    for i, stream in enumerate(map_streams):
        with stream:
            sg = cp.linalg.svd(A_gpu[i], compute_uv=False)
            # This is a little worse:
            # C_gpu = cp.asarray(np.random.randn(M, M), dtype=np.float32) 
            # sg = cp.linalg.svd(C_gpu, compute_uv=False)
    device.synchronize()
    elaps = time.time() - tm
    print('%20s: elaps=%f' % ('Cupy-Streams', elaps))

if __name__ == "__main__":
    many_svd_np_vs_cp()


```
**_Output:_**

```
               Numpy: elaps=2.181430
           Cupy-Loop: elaps=1.396355
  Cupy-ListComp: elaps=1.467271
   Cupy-Dask-Delayed: elaps=1.206578
        Cupy-Streams: elaps=~1.6 (was 3.104342 with bug)
```
**_NVVP:_**
Running in NVVP, here are two example of what the stream launches look like, for (N,M) = (64,256) and (N,M) = (16,1024), where N are the number of SVDs/streams, and the matrices are size MxM.  Each SVD is taking 23 ms (or 116 ms in the second case), which clearly is enough time to try and launch them concurrently.  Using 'C_gpu' instead of A_gpu[i] doesn't make a difference.

![nvvp_cupy_streams](https://user-images.githubusercontent.com/18181131/76151933-042b8080-6088-11ea-807b-acc0941b8de5.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cupy streams are not supporting concurrent GPU operation (cp.linalg.svd) #3174

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cupy streams are not supporting concurrent GPU operation (cp.linalg.svd) #3174

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions