When using CuPy with memory pool from multi-threaded app, sometimes it fails to launch a kernel (CUDADriverError: CUDA_ERROR_INVALID_CONTEXT: invalid device context). I think this is because CUDA Driver API (to launch kernel) is called without establishing context on the host thread.
Here is a simple code to reproduce:
import chainer # Enable memory pool; without this line the issue does not reproduce.
import cupy
import threading
def run(size):
# Uncomment the following line to explicitly establish CUDA context
# on the current host thread:
#cupy.cuda.runtime.free(0)
print(cupy.arange(size, dtype=int))
size = 1024
# Run in main thread; this is OK.
# CuPy mallocs memory via Runtime API, then launches kernel with Driver API.
run(size)
# Run in another thread; this fails.
# The executed thread tries to launch kernel without establishing context,
# as Runtime API is not used (memory block acquired in the previous run is
# reused from pool.)
t = threading.Thread(target=run, args=(size,))
t.start()
t.join()
As commented in the above code, I could workaround the problem by calling harmless Runtime API, e.g., cupy.cuda.runtime.free(0) to explicitly establish context on the host thread.
It would be great if CuPy could take care of such use case, but documenting the behavior may be enough.
When using CuPy with memory pool from multi-threaded app, sometimes it fails to launch a kernel (
CUDADriverError: CUDA_ERROR_INVALID_CONTEXT: invalid device context). I think this is because CUDA Driver API (to launch kernel) is called without establishing context on the host thread.Here is a simple code to reproduce:
As commented in the above code, I could workaround the problem by calling harmless Runtime API, e.g.,
cupy.cuda.runtime.free(0)to explicitly establish context on the host thread.It would be great if CuPy could take care of such use case, but documenting the behavior may be enough.