Skip to content

[python] aio: fix race condition causing asyncio.run() to hang forever during the shutdown process#40989

Closed
kevin85421 wants to merge 1 commit intogrpc:masterfrom
kevin85421:asyncio-hang
Closed

[python] aio: fix race condition causing asyncio.run() to hang forever during the shutdown process#40989
kevin85421 wants to merge 1 commit intogrpc:masterfrom
kevin85421:asyncio-hang

Conversation

@kevin85421
Copy link
Contributor

@kevin85421 kevin85421 commented Oct 31, 2025

Root cause

  • gRPC AIO creates a Unix domain socket pair, and the current thread passes the read socket to the event loop for reading, while the write socket is passed to a thread for polling events and writing a byte into the socket.
  • However, during the shutdown process, the event loop stops reading the read socket without closing it before the polling thread receives the final event to exit the thread.
  • The shutdown process will hang if (1) the event loop stops reading the read socket before the polling thread receives the final event to exit the thread, and (2) the polling process stuck at write syscall.
    • The write syscall may get stuck at sock_alloc_send_pskb when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread to join().
  • asyncio will shutdown the default executor (ThreadPoolExecutor) when asyncio.run(...) finishes. Hence, it hangs because some threads can't join.

Reproduction

  • Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue.

    sysctl -w net.core.rmem_default=8192
    sysctl -w net.core.rmem_default=8192
  • Step 1: Manually update unistd.write(fd, b'1', 1) to unistd.write(fd, b'1' * 4096, 4096). The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full.

  • Step 2: Create an aio.insecure_channel and use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times.

  • Step 3: If it hangs, check the following information:

    • ss -xpnm state connected | grep $PID => You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close to net.core.rmem_default.
    • Check the stack of the _poller_thread by running cat /proc/$PID/task/$TID/stack. The thread is stuck at sock_alloc_send_pskb because there is not enough buffer space to finish the write syscall.
    • Use GDB to find the _poller_thread and make sure it's stuck at write(), then print its $rdi to confirm that the FD is the one with a non-zero write buffer in the socket.

Test

Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases.

@kevin85421
Copy link
Contributor Author

kevin85421 commented Oct 31, 2025

I'm not sure who the best person is to review this PR, but I checked the label lang/Python, and I guess it's @sergiitk or @sreenithi? This issue happens frequently in some environments. Thank you!

@sergiitk
Copy link
Member

sergiitk commented Nov 3, 2025

@kevin85421 added this to gRPC Python triage meeting agenda.

@kevin85421
Copy link
Contributor Author

@sergiitk Thank you! Feel free to let me know which information you need. I can also attend a community sync to share some context if it helps.

@sergiitk sergiitk self-assigned this Nov 5, 2025
@sergiitk sergiitk added kokoro:run release notes: yes Indicates if PR needs to be in release notes and removed kokoro:run labels Nov 5, 2025
@kevin85421
Copy link
Contributor Author

@sergiitk I am not familiar with gRPC CI, but the CI failures seem to be unrelated to this PR (they seem to be related to ObjC and C/C++). Is there anything I can do to make this PR merge? Thanks!

@sergiitk
Copy link
Member

@kevin85421 I didn't get around reviewing the PR yet. I'll deflake the CI.

@kevin85421
Copy link
Contributor Author

@sergiitk thanks! I am happy to have a call with you and your team to explain the PR if this helps the PR review. Thanks!

@sergiitk sergiitk requested review from asheshvidyut and removed request for sreenithi November 26, 2025 16:57
@kevin85421
Copy link
Contributor Author

@sergiitk @asheshvidyut Happy Thanksgiving! Is there any update on this PR? I am happy to have a call with you and your team to explain the PR if this helps the PR review. Thanks!

@kevin85421
Copy link
Contributor Author

@asheshvidyut thank you for the review! Is this PR ready to merge?

@kevin85421
Copy link
Contributor Author

cc @sergiitk @asheshvidyut sorry to bother. Is there anything I can do to get it merged? Thanks!

@kevin85421
Copy link
Contributor Author

cc @asheshvidyut @sergiitk - sorry to bother again. I believe this issue blocks many users, but most can't identify the root cause and work around it to unblock themselves as I did. I hope you can reply and let me know how to move this PR forward.

It’s very frustrating for new contributors who receive no feedback from maintainers, hindering their ability to participate in the community.

cc top 10 contributors for gRPC in 2025. @ctiller @tanvi-jagtap @markdroth @yashykt @veblush @drfloob @ac-patel @rishesh007 @XuanWang-Amos @Vignesh2208

@sergiitk
Copy link
Member

@kevin85421 Please don't do that. Tagging unrelated contributors won't make it merge faster. We have limited resources and a process for reviewing code contribution.
Your PR needs 2 approvals from the gRPC Python team to get merged. We didn't get to your PR during previous triage meetings, but today we did. I will review it now.

@sergiitk sergiitk changed the title [python] aio: race condition causes asyncio.run() to hang forever during the shutdown process [python] aio: fix race condition causing asyncio.run() to hang forever during the shutdown process Dec 18, 2025
@kevin85421
Copy link
Contributor Author

kevin85421 commented Dec 18, 2025

@kevin85421 Please don't do that. Tagging unrelated contributors won't make it merge faster. We have limited resources and a process for reviewing code contribution. Your PR needs 2 approvals from the gRPC Python team to get merged. We didn't get to your PR during previous triage meetings, but today we did. I will review it now.

Hi @sergiitk,

I have maintained and contributed to multiple open-source projects. I have also mentored over 20 beginners, helping them become experienced open-source software contributors, including committers and maintainers, and later hired some of them.

Therefore, I am very familiar with how OSS communities operate and understand that OSS teams are typically understaffed. I tagged others because I became very frustrated after multiple follow-ups (#40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment), #40989 (comment)) without receiving a reply about how to move this forward.

I’m not here to fight. I contribute to open-source projects because I’m eager to learn new technologies and connect with people worldwide. Could you share how I can proceed next time to contribute effectively? Thanks.

sergiitk added a commit to sergiitk/grpc-py-repro that referenced this pull request Dec 18, 2025
…yncio.run() to hang forever during the shutdown process
@sergiitk
Copy link
Member

It's alright. Nobody is here to fight. Just saying tagging unrelated people is not productive.

That said, I can't seem to reproduce the deadlock. Here's what I tried: https://github.com/sergiitk/grpc-py-repro/tree/main/40989-aio-deadlock

From the review perspective, the change makes sense. I'm running some internal tests. If they pass, I'll get this merged. However, the repo is locked now due to high test flakiness (unrelated to this change). Don't know when it gets unlocked, but I expect this PR to get merged before the holiday season.

@sergiitk
Copy link
Member

sergiitk commented Dec 18, 2025

Just reproduced it after several re-runs. Though it's inconsistent on my side.

$ ss -xpnm state connected | grep 139574
u_str ESTAB 0      0      * 516168  * 516167 users:(("python3",pid=139574,fd=5))  skmem:(r0,rb8192,t0,tb8192,f0,w0,o0,bl0,d0)
u_str ESTAB 0      11264  * 503168  * 503167 users:(("python3",pid=139574,fd=9))  skmem:(r0,rb8192,t11264,tb8192,f0,w0,o0,bl0,d0)
u_str ESTAB 8188   0      * 503167  * 503168 users:(("python3",pid=139574,fd=8))  skmem:(r0,rb8192,t0,tb8192,f0,w0,o0,bl0,d0)
u_str ESTAB 0      0      * 516167  * 516168 users:(("python3",pid=139574,fd=4))  skmem:(r0,rb8192,t0,tb8192,f0,w0,o0,bl0,d0)
sudo cat /proc/139574/task/139578/stack
[<0>] sock_alloc_send_pskb+0x168/0x240
[<0>] unix_stream_sendmsg+0x167/0x6a0
[<0>] bpf_trampoline_6442564013+0xc1/0x16b
[<0>] unix_stream_sendmsg+0x9/0x6a0
[<0>] sock_write_iter+0x18e/0x1a0
[<0>] vfs_write+0x3b4/0x450
[<0>] ksys_write+0xbe/0xe0
[<0>] bpf_trampoline_6442502961+0x71/0x11b
[<0>] __x64_sys_write+0x9/0x20
[<0>] do_syscall_64+0x84/0x320
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

edit: cat correct task

Copy link
Member

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution

@kevin85421
Copy link
Contributor Author

@sergiitk thanks!

Just reproduced it after several re-runs. Though it's inconsistent on my side.

The issue can be consistently reproduced in some more stressful real-world situations. The good thing is that the system call also monitors signals; therefore, sending it a signal can be used as a workaround to prevent hanging.

@sergiitk
Copy link
Member

@kevin85421 Good news. Internal tests passed, the repo is being unlocked now. I'm hoping to get this merged today.

The issue can be consistently reproduced in some more stressful real-world situations.

Makes sense. My point was that I wasn't able to reproduce it consistently on my machine to verify the fix. And test run wasn't exactly fast. I ended up setting to re-run my repro overnight 20 times, and the fixed version hasn't locked once.

But more importantly, I ran your change against the full Google codebase that depends gRPC, which is millions of targets, each one may contain hundreds of tests.

Again, thank you for the contribution, and looking into this problem in such detail. Though the change is just a couple of lines, root-causing and debugging this probably wasn't easy. I needed extra time review time for this exact reason. I wish sock_alloc_send_pskb (and sock.c) in general had was documented better, that would've made things simpler. I found sk_buff documentation, and it ended up being helpful to a degree: https://docs.kernel.org/networking/skbuff.html.

@sergiitk
Copy link
Member

This is merged now.

@kevin85421
Copy link
Contributor Author

which is millions of targets, each one may contain hundreds of tests.

It’s impressive that gRPC has such comprehensive test coverage.

This is merged now.

Thank @sergiitk @asheshvidyut for the review!

asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Dec 29, 2025
…ver during the shutdown process (grpc#40989)

# Root cause
* gRPC AIO creates a Unix domain socket pair, and the current thread passes the read socket to the event loop for reading, while the write socket is passed to a thread for polling events and writing a byte into the socket.
* However, during the shutdown process, the event loop stops reading the read socket without closing it before the polling thread receives the final event to exit the thread.
* The shutdown process will hang if (1) the event loop stops reading the read socket before the polling thread receives the final event to exit the thread, and (2) the polling process stuck at `write` syscall.
  * The `write` syscall may get stuck at [sock_alloc_send_pskb](https://elixir.bootlin.com/linux/v5.15/source/net/core/sock.c#L2463) when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread to `join()`.
* `asyncio` will shutdown the default executor (`ThreadPoolExecutor`) when `asyncio.run(...)` finishes. Hence, it hangs because some threads can't join.

# Reproduction

* Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue.
   ```sh
   sysctl -w net.core.rmem_default=8192
   sysctl -w net.core.rmem_default=8192
   ```
* Step 1: Manually update `unistd.write(fd, b'1', 1)` to `unistd.write(fd, b'1' * 4096, 4096)`. The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full.
https://github.com/grpc/grpc/blob/8e67cb088d3709ae74c1ff31d1655bea6c2b86c0/src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi#L31

* Step 2: Create an `aio.insecure_channel` and use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times.

* Step 3: If it hangs, check the following information:
  * `ss -xpnm state connected | grep $PID` => You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close to `net.core.rmem_default`.
  * Check the stack of the `_poller_thread` by running `cat /proc/$PID/task/$TID/stack`. The thread is stuck at `sock_alloc_send_pskb` because there is not enough buffer space to finish the `write` syscall.
  * Use GDB to find the `_poller_thread` and make sure it's stuck at `write()`, then print its `$rdi` to confirm that the FD is the one with a non-zero write buffer in the socket.

# Test

Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases.

<!--

If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the appropriate
lang label.

-->

Closes grpc#40989

COPYBARA_INTEGRATE_REVIEW=grpc#40989 from kevin85421:asyncio-hang ff74508
PiperOrigin-RevId: 846425459
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kokoro:run lang/Python release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants