This bug is essentially the same as the previously-closed bug 12455. That bug was closed because it was fixed in 1.7.0... except we now have at least two ways to repro it in 1.8.2. :(
Basic summary of the bug: If a gRPC channel is (1) used at least once and (2) still in scope when you call fork(), and the subprocess tries to open a gRPC channel as well, then all RPC's on that channel will hang. Apparently, some global state is leaking across fork() boundaries in a bad way.
What version of gRPC and what language are you using?
gRPC 1.8.2, Python 3.6
What operating system (Linux, Windows, …) and version?
Repro'ed on OS X 10.13.2
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.6.2
What did you do?
Here is a minimal repro using the GCP datastore client:
import multiprocessing
from google.cloud import datastore
def causeTrouble(where: str):
client = datastore.Client(project='dev-storage-humu', namespace='aquarium')
client.get(client.key('c', 'aquarium'))
# The call to get() hangs forever; this line is never reached.
print('OK')
if __name__ == '__main__':
# Create a datastore client and do an RPC on it.
client = datastore.Client(project='dev-storage-humu', namespace='aquarium')
client.get(client.key('c', 'aquarium'))
# Kick off a child process while the first client is still in scope.
process = multiprocessing.Process(target=causeTrouble,
args=['child process'])
process.start()
- If you change this so that it doesn't call client.get() from main, just creating the client and then forking, it works fine.
- If you change this so that instead of creating the client and calling client.get() directly, it calls causeTrouble() before forking, it also works fine. (NB that Python has eager GC, so the difference between the two is that the client is no longer in scope if you do it this other way)
- In another server, I worked around this by calling subprocess.run() instead of multiprocessing.Process(); since subprocess() does a fork() + exec(), it has no state surviving across the process boundary. However, the call to exec() strictly limits the communication with the parent process, basically making all of the multiprocessing library unusable.
If you kill it while it's hanging, you get this stack trace:
Traceback (most recent call last):
File "/Users/zunger/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/zunger/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "minimal_repro.py", line 8, in causeTrouble
client.get(client.key('c', 'aquarium'))
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 309, in get
deferred=deferred, transaction=transaction)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 356, in get_multi
transaction_id=transaction and transaction.id,
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 138, in _extended_lookup
project, read_options, key_pbs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/_gax.py", line 115, in lookup
return super(GAPICDatastoreAPI, self).lookup(*args, **kwargs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/gapic/datastore/v1/datastore_client.py", line 204, in lookup
return self._lookup(request, options)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
return api_caller(api_call, this_settings, request)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
return api_call(*args)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
return a_func(*args, **kwargs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
return to_call(*args)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
return a_func(*updated_args, **kwargs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/grpc/_channel.py", line 484, in __call__
credentials)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/grpc/_channel.py", line 478, in _blocking
_handle_event(completion_queue.poll(), state,
File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
On b/12455, @katbusch reported that simply importing certain libraries prior to fork (e.g. from google.cloud import bigquery) was sufficient to trigger this bug. Presumably they do enough initialization on import to trigger this.
That makes workarounds where we pre-fork a pile of worker processes and pass them jobs as needed more difficult, although possible as a short-term measure.
Anything else we should know about your project / environment?
The use case which necessitates this is that we have a server (which uses GCP features extensively, e.g. for storage) that needs to fork off subprocesses in which to run long-running operations. (It's basically an analysis pipeline manager) Since Python is heavily reliant on fork() for parallelization (thanks to the GIL there's no real 'parallelism' in its threading) this is the main approach possible.
This bug is essentially the same as the previously-closed bug 12455. That bug was closed because it was fixed in 1.7.0... except we now have at least two ways to repro it in 1.8.2. :(
Basic summary of the bug: If a gRPC channel is (1) used at least once and (2) still in scope when you call fork(), and the subprocess tries to open a gRPC channel as well, then all RPC's on that channel will hang. Apparently, some global state is leaking across fork() boundaries in a bad way.
What version of gRPC and what language are you using?
gRPC 1.8.2, Python 3.6
What operating system (Linux, Windows, …) and version?
Repro'ed on OS X 10.13.2
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.6.2
What did you do?
Here is a minimal repro using the GCP datastore client:
If you kill it while it's hanging, you get this stack trace:
On b/12455, @katbusch reported that simply importing certain libraries prior to fork (e.g.
from google.cloud import bigquery) was sufficient to trigger this bug. Presumably they do enough initialization on import to trigger this.That makes workarounds where we pre-fork a pile of worker processes and pass them jobs as needed more difficult, although possible as a short-term measure.
Anything else we should know about your project / environment?
The use case which necessitates this is that we have a server (which uses GCP features extensively, e.g. for storage) that needs to fork off subprocesses in which to run long-running operations. (It's basically an analysis pipeline manager) Since Python is heavily reliant on fork() for parallelization (thanks to the GIL there's no real 'parallelism' in its threading) this is the main approach possible.