Skip to content

Calling multiple cupy functions in multiple CPU threads results in poor performance #4040

@mrocklin

Description

@mrocklin

Given the following computation:

import dask.array as da
import cupy
import dask
import numpy as np

arrays = [da.from_delayed(dask.delayed(cupy.random.random)((5000, 5000)), shape=(5000, 5000),  dtype='float32')  
          for _ in range(100)]
x = da.concatenate(arrays)

This is faster when run in one CPU thread instead of many

da.sin(x).sum().compute(scheduler='single-threaded')  # 350ms
da.sin(x).sum().compute(scheduler='threading')  # 150ms

This isn't super-surprising. We're probably over-saturating some resource on the GPU that wasn't designed for concurrency. That being said, I'm curious what the best way to handle operations like this is. I do have a few GPUs on this box, and probably each of these tasks is too small to saturate one of them.

@seibert @sklam I'd be curious to get your take on handling performance here. Can you recommend reading that explains how I should think about concurrency on top of these things?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions