Failed CI report
*** [err]: diskless slow replicas drop during rdb pipe in tests/integration/replication.tcl
rdb child didn't terminate
Summary
During diskless replication, if any single replica cannot accept a write (TCP send buffer full / EAGAIN), the master stops reading the RDB pipe entirely, stalling data delivery to all replicas — including fast ones that are ready to receive data.
The failure reason is similar to #14946, the socket buffer is more easy to fill.
Root Cause
In rdbPipeReadHandler, the master reads from the child's RDB pipe and writes to all replica sockets in a loop. When connWrite to any replica returns a partial write (socket send buffer full), the handler:
- Installs a per-replica
rdbPipeWriteHandler and increments rdb_pipe_numconns_writing
- Removes the pipe read event via
aeDeleteFileEvent(server.el, server.rdb_pipe_read, AE_READABLE), stopping all pipe reads
The pipe read event is only re-enabled when all pending write handlers complete (rdb_pipe_numconns_writing == 0), meaning the slowest replica dictates the throughput for all replicas.
Observed Behavior
With one slow replica (consuming at ~290 KB/s due to key-load-delay):
- Master bursts ~1.3 MB of RDB data until the slow replica's socket send buffer fills
rdbPipeReadHandler disables the pipe read event
- All replicas starve for 4–5 seconds while the slow replica drains its buffer
- Cycle repeats: burst → stall → burst → stall
Ultimately, it leads to a very slow synchronization process of the entire master and replica.
Failed CI report
Summary
During diskless replication, if any single replica cannot accept a write (TCP send buffer full /
EAGAIN), the master stops reading the RDB pipe entirely, stalling data delivery to all replicas — including fast ones that are ready to receive data.The failure reason is similar to #14946, the socket buffer is more easy to fill.
Root Cause
In
rdbPipeReadHandler, the master reads from the child's RDB pipe and writes to all replica sockets in a loop. WhenconnWriteto any replica returns a partial write (socket send buffer full), the handler:rdbPipeWriteHandlerand incrementsrdb_pipe_numconns_writingaeDeleteFileEvent(server.el, server.rdb_pipe_read, AE_READABLE), stopping all pipe readsThe pipe read event is only re-enabled when all pending write handlers complete (
rdb_pipe_numconns_writing == 0), meaning the slowest replica dictates the throughput for all replicas.Observed Behavior
With one slow replica (consuming at ~290 KB/s due to
key-load-delay):rdbPipeReadHandlerdisables the pipe read eventUltimately, it leads to a very slow synchronization process of the entire master and replica.