-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Sporadic connection failures due to 'ping timeout' in grpcio==1.68.0 and newer. #39113
Description
The python grpc release grpcio==1.68.0 introduces a regression that I can consistently reproduce in an Apache Beam pipeline running on Cloud Dataflow.
In the reproduction, I have two processes: a Python process (client) and a C++ process (server), running in their own docker containers. The processes establish streaming bidirectional RPC channels with each other, and both run on the same vm, connecting to a localhost:someport address.
The client process writes ~15-50 GB of data over network to GCS in a separate thread, while the connection channels with the server owned by other threads.
If the amount of data written in the side thread crosses a certain threshold (between 10 and 15 GB), the GRPC connections between client and server starts to terminate with errors like:
UNKNOWN:Error received from peer ipv6:%!B(MISSING)::1%!D(MISSING):12371 {created_time:\"2024-12-03T13:53:05.992753213+00:00\", grpc_status:14, grpc_message:"ping timeout"
Mitigation:
Set an environment variable: GRPC_EXPERIMENTS="-event_engine_client" in the environment of the Python process or downgrade to an earlier version of grpc. We are sticking with "grpc<1.66.0" in Apache Beam for now and don't reproduce this error.
cc: @drfloob, @yashykt @XuanWang-Amos who have started investigating this and might be able to add details and/or rootcause once more information becomes available.
What operating system (Linux, Windows,...) and version?
Linux.
Reproducible on grpcio==1.68.0 and newer, including the current latest version (grpcio==1.71.0).
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.10