Skip to content

Client-side Python fork support can hang on Python 3.7 #18075

@ericgribkoff

Description

@ericgribkoff

#16513 adds fork-without-exec tests for Python clients using GRPC_ENABLE_FORK_SUPPORT=1 and the multiprocessing library. The tests are flaky on Kokoro using Python 3.7, leading to what appear to be hanging child processes.

I have been unable to reproduce the problem locally to attach a debugger and determine the exact problem, but based on a series of trials on Kokoro it looks the child's post-fork Python handler is deadlocking either when obtaining the GIL and or at some point before beginning to clean-up inherited channels. This was determined via inserting calls to C++'s abort() or Python's os._exit at various points within the post-fork handler: aborting the process before obtaining the GIL deterministically avoided the deadlock, but even simply exiting the process with the GIL (before doing anything potentially suspicious, such as obtaining a condition variable's lock) could still result in a hung child process.

Since it appears that the deadlock is present on Python 3.7 and not on Python 2, and the hang does not appear to occur as a result of an error in our fork-handlers (such as trying to obtain a lock that was held by another thread pre-fork), I suspect that the issue is a result of our pre-fork handlers only pausing active gRPC Python threads in block_if_fork_in_progress, where threads block on a condition variable until the fork completes. My guess is that these threads, since they are abruptly terminated at fork, are still holding some internal Python lock or otherwise leave the interpreter's state in a broken condition post-fork. I couldn't find a corresponding known issue on the Python issue tracker, nor have I confirmed this to be the case, but it seems the most likely cause at this point - it's also not clear that this would even indicate a bug in Python, as the documentation is unclear as to whether it's even intended to be safe to fork with living (but blocking) threads via the multiprocessing module.

One option to both test the above hypothesis and hopefully fix the underlying issue would be to change gRPC Python's fork handlers to actually stop (join) its threads pre-fork, and resume them post-fork. This is the behavior of gRPC core's fork handlers. This is a somewhat invasive change, as unlike core, gRPC Python doesn't have any executor/queue concept for the thread's on-going work to be placed while the thread itself is joined and then re-created in the parent's post-fork handler. A lighter weight approach that may also suffice is to have the threads simply release the GIL prior to fork, block, then reobtain the GIL after fork in the parent.

The spawn option added to multiprocessing in Python 3.4 may also present an alternative to the current fork-support via fork handlers for Python 3.7 users that would workaround the above issue entirely.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions