[python] aio: fix race condition causing `asyncio.run()` to hang forever during the shutdown process by kevin85421 · Pull Request #40989 · grpc/grpc

kevin85421 · 2025-10-31T09:56:57Z

Root cause

gRPC AIO creates a Unix domain socket pair, and the current thread passes the read socket to the event loop for reading, while the write socket is passed to a thread for polling events and writing a byte into the socket.
However, during the shutdown process, the event loop stops reading the read socket without closing it before the polling thread receives the final event to exit the thread.
The shutdown process will hang if (1) the event loop stops reading the read socket before the polling thread receives the final event to exit the thread, and (2) the polling process stuck at write syscall.
- The write syscall may get stuck at sock_alloc_send_pskb when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread to join().
asyncio will shutdown the default executor (ThreadPoolExecutor) when asyncio.run(...) finishes. Hence, it hangs because some threads can't join.

Reproduction

Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue.
```
sysctl -w net.core.rmem_default=8192
sysctl -w net.core.rmem_default=8192
```
Step 1: Manually update unistd.write(fd, b'1', 1) to unistd.write(fd, b'1' * 4096, 4096). The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full.

grpc/src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi

Line 31 in 8e67cb0

unistd.write(fd, b"1", 1)
Step 2: Create an aio.insecure_channel and use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times.
Step 3: If it hangs, check the following information:
- ss -xpnm state connected | grep $PID => You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close to net.core.rmem_default.
- Check the stack of the _poller_thread by running cat /proc/$PID/task/$TID/stack. The thread is stuck at sock_alloc_send_pskb because there is not enough buffer space to finish the write syscall.
- Use GDB to find the _poller_thread and make sure it's stuck at write(), then print its $rdi to confirm that the FD is the one with a non-zero write buffer in the socket.

Test

Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases.

kevin85421 · 2025-10-31T20:06:50Z

I'm not sure who the best person is to review this PR, but I checked the label lang/Python, and I guess it's @sergiitk or @sreenithi? This issue happens frequently in some environments. Thank you!

sergiitk · 2025-11-03T23:36:25Z

@kevin85421 added this to gRPC Python triage meeting agenda.

kevin85421 · 2025-11-04T05:23:17Z

@sergiitk Thank you! Feel free to let me know which information you need. I can also attend a community sync to share some context if it helps.

kevin85421 · 2025-11-10T18:16:34Z

@sergiitk I am not familiar with gRPC CI, but the CI failures seem to be unrelated to this PR (they seem to be related to ObjC and C/C++). Is there anything I can do to make this PR merge? Thanks!

sergiitk · 2025-11-11T23:28:13Z

@kevin85421 I didn't get around reviewing the PR yet. I'll deflake the CI.

kevin85421 · 2025-11-12T02:54:51Z

@sergiitk thanks! I am happy to have a call with you and your team to explain the PR if this helps the PR review. Thanks!

kevin85421 · 2025-12-01T18:21:44Z

@sergiitk @asheshvidyut Happy Thanksgiving! Is there any update on this PR? I am happy to have a call with you and your team to explain the PR if this helps the PR review. Thanks!

kevin85421 · 2025-12-08T19:33:33Z

@asheshvidyut thank you for the review! Is this PR ready to merge?

kevin85421 · 2025-12-11T22:56:24Z

cc @sergiitk @asheshvidyut sorry to bother. Is there anything I can do to get it merged? Thanks!

kevin85421 · 2025-12-17T23:24:26Z

cc @asheshvidyut @sergiitk - sorry to bother again. I believe this issue blocks many users, but most can't identify the root cause and work around it to unblock themselves as I did. I hope you can reply and let me know how to move this PR forward.

It’s very frustrating for new contributors who receive no feedback from maintainers, hindering their ability to participate in the community.

cc top 10 contributors for gRPC in 2025. @ctiller @tanvi-jagtap @markdroth @yashykt @veblush @drfloob @ac-patel @rishesh007 @XuanWang-Amos @Vignesh2208

sergiitk · 2025-12-18T05:21:43Z

@kevin85421 Please don't do that. Tagging unrelated contributors won't make it merge faster. We have limited resources and a process for reviewing code contribution.
Your PR needs 2 approvals from the gRPC Python team to get merged. We didn't get to your PR during previous triage meetings, but today we did. I will review it now.

kevin85421 · 2025-12-18T06:10:50Z

@kevin85421 Please don't do that. Tagging unrelated contributors won't make it merge faster. We have limited resources and a process for reviewing code contribution. Your PR needs 2 approvals from the gRPC Python team to get merged. We didn't get to your PR during previous triage meetings, but today we did. I will review it now.

Hi @sergiitk,

I have maintained and contributed to multiple open-source projects. I have also mentored over 20 beginners, helping them become experienced open-source software contributors, including committers and maintainers, and later hired some of them.

Therefore, I am very familiar with how OSS communities operate and understand that OSS teams are typically understaffed. I tagged others because I became very frustrated after multiple follow-ups (#40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment)) without receiving a reply about how to move this forward.

I’m not here to fight. I contribute to open-source projects because I’m eager to learn new technologies and connect with people worldwide. Could you share how I can proceed next time to contribute effectively? Thanks.

…yncio.run() to hang forever during the shutdown process

sergiitk · 2025-12-18T07:24:32Z

It's alright. Nobody is here to fight. Just saying tagging unrelated people is not productive.

That said, I can't seem to reproduce the deadlock. Here's what I tried: https://github.com/sergiitk/grpc-py-repro/tree/main/40989-aio-deadlock

From the review perspective, the change makes sense. I'm running some internal tests. If they pass, I'll get this merged. However, the repo is locked now due to high test flakiness (unrelated to this change). Don't know when it gets unlocked, but I expect this PR to get merged before the holiday season.

sergiitk · 2025-12-18T07:51:37Z

Just reproduced it after several re-runs. Though it's inconsistent on my side.

$ ss -xpnm state connected | grep 139574
u_str ESTAB 0      0      * 516168  * 516167 users:(("python3",pid=139574,fd=5))  skmem:(r0,rb8192,t0,tb8192,f0,w0,o0,bl0,d0)
u_str ESTAB 0      11264  * 503168  * 503167 users:(("python3",pid=139574,fd=9))  skmem:(r0,rb8192,t11264,tb8192,f0,w0,o0,bl0,d0)
u_str ESTAB 8188   0      * 503167  * 503168 users:(("python3",pid=139574,fd=8))  skmem:(r0,rb8192,t0,tb8192,f0,w0,o0,bl0,d0)
u_str ESTAB 0      0      * 516167  * 516168 users:(("python3",pid=139574,fd=4))  skmem:(r0,rb8192,t0,tb8192,f0,w0,o0,bl0,d0)

sudo cat /proc/139574/task/139578/stack
[<0>] sock_alloc_send_pskb+0x168/0x240
[<0>] unix_stream_sendmsg+0x167/0x6a0
[<0>] bpf_trampoline_6442564013+0xc1/0x16b
[<0>] unix_stream_sendmsg+0x9/0x6a0
[<0>] sock_write_iter+0x18e/0x1a0
[<0>] vfs_write+0x3b4/0x450
[<0>] ksys_write+0xbe/0xe0
[<0>] bpf_trampoline_6442502961+0x71/0x11b
[<0>] __x64_sys_write+0x9/0x20
[<0>] do_syscall_64+0x84/0x320
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

edit: cat correct task

sergiitk

Thank you for the contribution

kevin85421 · 2025-12-18T18:13:36Z

@sergiitk thanks!

Just reproduced it after several re-runs. Though it's inconsistent on my side.

The issue can be consistently reproduced in some more stressful real-world situations. The good thing is that the system call also monitors signals; therefore, sending it a signal can be used as a workaround to prevent hanging.

sergiitk · 2025-12-18T20:27:33Z

@kevin85421 Good news. Internal tests passed, the repo is being unlocked now. I'm hoping to get this merged today.

The issue can be consistently reproduced in some more stressful real-world situations.

Makes sense. My point was that I wasn't able to reproduce it consistently on my machine to verify the fix. And test run wasn't exactly fast. I ended up setting to re-run my repro overnight 20 times, and the fixed version hasn't locked once.

But more importantly, I ran your change against the full Google codebase that depends gRPC, which is millions of targets, each one may contain hundreds of tests.

Again, thank you for the contribution, and looking into this problem in such detail. Though the change is just a couple of lines, root-causing and debugging this probably wasn't easy. I needed extra time review time for this exact reason. I wish sock_alloc_send_pskb (and sock.c) in general had was documented better, that would've made things simpler. I found sk_buff documentation, and it ended up being helpful to a degree: https://docs.kernel.org/networking/skbuff.html.

sergiitk · 2025-12-19T00:49:22Z

This is merged now.

kevin85421 · 2025-12-19T01:08:08Z

which is millions of targets, each one may contain hundreds of tests.

It’s impressive that gRPC has such comprehensive test coverage.

This is merged now.

Thank @sergiitk @asheshvidyut for the review!

…ver during the shutdown process (grpc#40989) # Root cause * gRPC AIO creates a Unix domain socket pair, and the current thread passes the read socket to the event loop for reading, while the write socket is passed to a thread for polling events and writing a byte into the socket. * However, during the shutdown process, the event loop stops reading the read socket without closing it before the polling thread receives the final event to exit the thread. * The shutdown process will hang if (1) the event loop stops reading the read socket before the polling thread receives the final event to exit the thread, and (2) the polling process stuck at `write` syscall. * The `write` syscall may get stuck at [sock_alloc_send_pskb](https://elixir.bootlin.com/linux/v5.15/source/net/core/sock.c#L2463) when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread to `join()`. * `asyncio` will shutdown the default executor (`ThreadPoolExecutor`) when `asyncio.run(...)` finishes. Hence, it hangs because some threads can't join. # Reproduction * Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue. ```sh sysctl -w net.core.rmem_default=8192 sysctl -w net.core.rmem_default=8192 ``` * Step 1: Manually update `unistd.write(fd, b'1', 1)` to `unistd.write(fd, b'1' * 4096, 4096)`. The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full. https://github.com/grpc/grpc/blob/8e67cb088d3709ae74c1ff31d1655bea6c2b86c0/src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi#L31 * Step 2: Create an `aio.insecure_channel` and use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times. * Step 3: If it hangs, check the following information: * `ss -xpnm state connected | grep $PID` => You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close to `net.core.rmem_default`. * Check the stack of the `_poller_thread` by running `cat /proc/$PID/task/$TID/stack`. The thread is stuck at `sock_alloc_send_pskb` because there is not enough buffer space to finish the `write` syscall. * Use GDB to find the `_poller_thread` and make sure it's stuck at `write()`, then print its `$rdi` to confirm that the FD is the one with a non-zero write buffer in the socket. # Test Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases.  Closes grpc#40989 COPYBARA_INTEGRATE_REVIEW=grpc#40989 from kevin85421:asyncio-hang ff74508 PiperOrigin-RevId: 846425459

update

ff74508

github-actions bot added the lang/Python label Oct 31, 2025

sergiitk requested review from sergiitk and sreenithi November 3, 2025 23:37

sreenithi added the kokoro:run label Nov 5, 2025

grpc-kokoro removed the kokoro:run label Nov 5, 2025

sergiitk self-assigned this Nov 5, 2025

sergiitk added kokoro:run release notes: yes Indicates if PR needs to be in release notes and removed kokoro:run labels Nov 5, 2025

sergiitk requested review from asheshvidyut and removed request for sreenithi November 26, 2025 16:57

asheshvidyut approved these changes Dec 3, 2025

View reviewed changes

sergiitk added the kokoro:run label Dec 18, 2025

sergiitk changed the title ~~[python] aio: race condition causes asyncio.run() to hang forever during the shutdown process~~ [python] aio: fix race condition causing asyncio.run() to hang forever during the shutdown process Dec 18, 2025

sergiitk added a commit to sergiitk/grpc-py-repro that referenced this pull request Dec 18, 2025

repro for grpc/grpc#40989 [python] aio: fix race condition causing as…

369d5a3

…yncio.run() to hang forever during the shutdown process

sergiitk approved these changes Dec 18, 2025

View reviewed changes

copybara-service bot closed this in ddbfe03 Dec 18, 2025

Conversation

kevin85421 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Reproduction

Test

Uh oh!

kevin85421 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergiitk commented Nov 3, 2025

Uh oh!

kevin85421 commented Nov 4, 2025

Uh oh!

kevin85421 commented Nov 10, 2025

Uh oh!

sergiitk commented Nov 11, 2025

Uh oh!

kevin85421 commented Nov 12, 2025

Uh oh!

kevin85421 commented Dec 1, 2025

Uh oh!

kevin85421 commented Dec 8, 2025

Uh oh!

kevin85421 commented Dec 11, 2025

Uh oh!

kevin85421 commented Dec 17, 2025

Uh oh!

sergiitk commented Dec 18, 2025

Uh oh!

kevin85421 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergiitk commented Dec 18, 2025

Uh oh!

sergiitk commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergiitk left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented Dec 18, 2025

Uh oh!

sergiitk commented Dec 18, 2025

Uh oh!

sergiitk commented Dec 19, 2025

Uh oh!

kevin85421 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kevin85421 commented Oct 31, 2025 •

edited

Loading

kevin85421 commented Oct 31, 2025 •

edited

Loading

kevin85421 commented Dec 18, 2025 •

edited

Loading

sergiitk commented Dec 18, 2025 •

edited

Loading