[python] aio: fix race condition causing asyncio.run() to hang forever during the shutdown process#40989
[python] aio: fix race condition causing asyncio.run() to hang forever during the shutdown process#40989kevin85421 wants to merge 1 commit intogrpc:masterfrom
asyncio.run() to hang forever during the shutdown process#40989Conversation
|
I'm not sure who the best person is to review this PR, but I checked the label lang/Python, and I guess it's @sergiitk or @sreenithi? This issue happens frequently in some environments. Thank you! |
|
@kevin85421 added this to gRPC Python triage meeting agenda. |
|
@sergiitk Thank you! Feel free to let me know which information you need. I can also attend a community sync to share some context if it helps. |
|
@sergiitk I am not familiar with gRPC CI, but the CI failures seem to be unrelated to this PR (they seem to be related to ObjC and C/C++). Is there anything I can do to make this PR merge? Thanks! |
|
@kevin85421 I didn't get around reviewing the PR yet. I'll deflake the CI. |
|
@sergiitk thanks! I am happy to have a call with you and your team to explain the PR if this helps the PR review. Thanks! |
|
@sergiitk @asheshvidyut Happy Thanksgiving! Is there any update on this PR? I am happy to have a call with you and your team to explain the PR if this helps the PR review. Thanks! |
|
@asheshvidyut thank you for the review! Is this PR ready to merge? |
|
cc @sergiitk @asheshvidyut sorry to bother. Is there anything I can do to get it merged? Thanks! |
|
cc @asheshvidyut @sergiitk - sorry to bother again. I believe this issue blocks many users, but most can't identify the root cause and work around it to unblock themselves as I did. I hope you can reply and let me know how to move this PR forward. It’s very frustrating for new contributors who receive no feedback from maintainers, hindering their ability to participate in the community. cc top 10 contributors for gRPC in 2025. @ctiller @tanvi-jagtap @markdroth @yashykt @veblush @drfloob @ac-patel @rishesh007 @XuanWang-Amos @Vignesh2208 |
|
@kevin85421 Please don't do that. Tagging unrelated contributors won't make it merge faster. We have limited resources and a process for reviewing code contribution. |
asyncio.run() to hang forever during the shutdown processasyncio.run() to hang forever during the shutdown process
Hi @sergiitk, I have maintained and contributed to multiple open-source projects. I have also mentored over 20 beginners, helping them become experienced open-source software contributors, including committers and maintainers, and later hired some of them. Therefore, I am very familiar with how OSS communities operate and understand that OSS teams are typically understaffed. I tagged others because I became very frustrated after multiple follow-ups (#40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment)) without receiving a reply about how to move this forward. I’m not here to fight. I contribute to open-source projects because I’m eager to learn new technologies and connect with people worldwide. Could you share how I can proceed next time to contribute effectively? Thanks. |
…yncio.run() to hang forever during the shutdown process
|
It's alright. Nobody is here to fight. Just saying tagging unrelated people is not productive. That said, I can't seem to reproduce the deadlock. Here's what I tried: https://github.com/sergiitk/grpc-py-repro/tree/main/40989-aio-deadlock From the review perspective, the change makes sense. I'm running some internal tests. If they pass, I'll get this merged. However, the repo is locked now due to high test flakiness (unrelated to this change). Don't know when it gets unlocked, but I expect this PR to get merged before the holiday season. |
|
Just reproduced it after several re-runs. Though it's inconsistent on my side. edit: cat correct task |
sergiitk
left a comment
There was a problem hiding this comment.
Thank you for the contribution
|
@sergiitk thanks!
The issue can be consistently reproduced in some more stressful real-world situations. The good thing is that the system call also monitors signals; therefore, sending it a signal can be used as a workaround to prevent hanging. |
|
@kevin85421 Good news. Internal tests passed, the repo is being unlocked now. I'm hoping to get this merged today.
Makes sense. My point was that I wasn't able to reproduce it consistently on my machine to verify the fix. And test run wasn't exactly fast. I ended up setting to re-run my repro overnight 20 times, and the fixed version hasn't locked once. But more importantly, I ran your change against the full Google codebase that depends gRPC, which is millions of targets, each one may contain hundreds of tests. Again, thank you for the contribution, and looking into this problem in such detail. Though the change is just a couple of lines, root-causing and debugging this probably wasn't easy. I needed extra time review time for this exact reason. I wish |
|
This is merged now. |
It’s impressive that gRPC has such comprehensive test coverage.
Thank @sergiitk @asheshvidyut for the review! |
…ver during the shutdown process (grpc#40989) # Root cause * gRPC AIO creates a Unix domain socket pair, and the current thread passes the read socket to the event loop for reading, while the write socket is passed to a thread for polling events and writing a byte into the socket. * However, during the shutdown process, the event loop stops reading the read socket without closing it before the polling thread receives the final event to exit the thread. * The shutdown process will hang if (1) the event loop stops reading the read socket before the polling thread receives the final event to exit the thread, and (2) the polling process stuck at `write` syscall. * The `write` syscall may get stuck at [sock_alloc_send_pskb](https://elixir.bootlin.com/linux/v5.15/source/net/core/sock.c#L2463) when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread to `join()`. * `asyncio` will shutdown the default executor (`ThreadPoolExecutor`) when `asyncio.run(...)` finishes. Hence, it hangs because some threads can't join. # Reproduction * Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue. ```sh sysctl -w net.core.rmem_default=8192 sysctl -w net.core.rmem_default=8192 ``` * Step 1: Manually update `unistd.write(fd, b'1', 1)` to `unistd.write(fd, b'1' * 4096, 4096)`. The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full. https://github.com/grpc/grpc/blob/8e67cb088d3709ae74c1ff31d1655bea6c2b86c0/src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi#L31 * Step 2: Create an `aio.insecure_channel` and use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times. * Step 3: If it hangs, check the following information: * `ss -xpnm state connected | grep $PID` => You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close to `net.core.rmem_default`. * Check the stack of the `_poller_thread` by running `cat /proc/$PID/task/$TID/stack`. The thread is stuck at `sock_alloc_send_pskb` because there is not enough buffer space to finish the `write` syscall. * Use GDB to find the `_poller_thread` and make sure it's stuck at `write()`, then print its `$rdi` to confirm that the FD is the one with a non-zero write buffer in the socket. # Test Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases. <!-- If you know who should review your pull request, please assign it to that person, otherwise the pull request would get assigned randomly. If your pull request is for a specific language, please add the appropriate lang label. --> Closes grpc#40989 COPYBARA_INTEGRATE_REVIEW=grpc#40989 from kevin85421:asyncio-hang ff74508 PiperOrigin-RevId: 846425459
Root cause
writesyscall.writesyscall may get stuck at sock_alloc_send_pskb when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread tojoin().asynciowill shutdown the default executor (ThreadPoolExecutor) whenasyncio.run(...)finishes. Hence, it hangs because some threads can't join.Reproduction
Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue.
Step 1: Manually update
unistd.write(fd, b'1', 1)tounistd.write(fd, b'1' * 4096, 4096). The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full.grpc/src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi
Line 31 in 8e67cb0
Step 2: Create an
aio.insecure_channeland use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times.Step 3: If it hangs, check the following information:
ss -xpnm state connected | grep $PID=> You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close tonet.core.rmem_default._poller_threadby runningcat /proc/$PID/task/$TID/stack. The thread is stuck atsock_alloc_send_pskbbecause there is not enough buffer space to finish thewritesyscall._poller_threadand make sure it's stuck atwrite(), then print its$rdito confirm that the FD is the one with a non-zero write buffer in the socket.Test
Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases.