Skip to content

Sporadic connection failures due to 'ping timeout' in grpcio==1.68.0 and newer. #39113

@tvalentyn

Description

@tvalentyn

The python grpc release grpcio==1.68.0 introduces a regression that I can consistently reproduce in an Apache Beam pipeline running on Cloud Dataflow.

In the reproduction, I have two processes: a Python process (client) and a C++ process (server), running in their own docker containers. The processes establish streaming bidirectional RPC channels with each other, and both run on the same vm, connecting to a localhost:someport address.

The client process writes ~15-50 GB of data over network to GCS in a separate thread, while the connection channels with the server owned by other threads.

If the amount of data written in the side thread crosses a certain threshold (between 10 and 15 GB), the GRPC connections between client and server starts to terminate with errors like:

UNKNOWN:Error received from peer ipv6:%!B(MISSING)::1%!D(MISSING):12371 {created_time:\"2024-12-03T13:53:05.992753213+00:00\", grpc_status:14, grpc_message:"ping timeout"

Mitigation:

Set an environment variable: GRPC_EXPERIMENTS="-event_engine_client" in the environment of the Python process or downgrade to an earlier version of grpc. We are sticking with "grpc<1.66.0" in Apache Beam for now and don't reproduce this error.

cc: @drfloob, @yashykt @XuanWang-Amos who have started investigating this and might be able to add details and/or rootcause once more information becomes available.

What operating system (Linux, Windows,...) and version?

Linux.

Reproducible on grpcio==1.68.0 and newer, including the current latest version (grpcio==1.71.0).

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.10

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions