Given the following computation:
import dask.array as da
import cupy
import dask
import numpy as np
arrays = [da.from_delayed(dask.delayed(cupy.random.random)((5000, 5000)), shape=(5000, 5000), dtype='float32')
for _ in range(100)]
x = da.concatenate(arrays)
This is faster when run in one CPU thread instead of many
da.sin(x).sum().compute(scheduler='single-threaded') # 350ms
da.sin(x).sum().compute(scheduler='threading') # 150ms
This isn't super-surprising. We're probably over-saturating some resource on the GPU that wasn't designed for concurrency. That being said, I'm curious what the best way to handle operations like this is. I do have a few GPUs on this box, and probably each of these tasks is too small to saturate one of them.
@seibert @sklam I'd be curious to get your take on handling performance here. Can you recommend reading that explains how I should think about concurrency on top of these things?
Given the following computation:
This is faster when run in one CPU thread instead of many
This isn't super-surprising. We're probably over-saturating some resource on the GPU that wasn't designed for concurrency. That being said, I'm curious what the best way to handle operations like this is. I do have a few GPUs on this box, and probably each of these tasks is too small to saturate one of them.
@seibert @sklam I'd be curious to get your take on handling performance here. Can you recommend reading that explains how I should think about concurrency on top of these things?