Calling multiple cupy functions in multiple CPU threads results in poor performance

Given the following computation:

```python
import dask.array as da
import cupy
import dask
import numpy as np

arrays = [da.from_delayed(dask.delayed(cupy.random.random)((5000, 5000)), shape=(5000, 5000),  dtype='float32')  
          for _ in range(100)]
x = da.concatenate(arrays)
```

This is faster when run in one CPU thread instead of many

```python
da.sin(x).sum().compute(scheduler='single-threaded')  # 350ms
da.sin(x).sum().compute(scheduler='threading')  # 150ms
```

This isn't super-surprising.  We're probably over-saturating some resource on the GPU that wasn't designed for concurrency.  That being said, I'm curious what the best way to handle operations like this is.  I *do* have a few GPUs on this box, and probably each of these tasks is too small to saturate one of them.

@seibert @sklam I'd be curious to get your take on handling performance here.  Can you recommend reading that explains how I should think about concurrency on top of these things?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calling multiple cupy functions in multiple CPU threads results in poor performance #4040

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Calling multiple cupy functions in multiple CPU threads results in poor performance #4040

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions