Skip to content

[TEST] diskless $all_drop replicas drop during rdb pipe test failed #14983

Description

@sundb

Failed CI report

*** [err]: diskless slow replicas drop during rdb pipe in tests/integration/replication.tcl
rdb child didn't terminate

Summary

During diskless replication, if any single replica cannot accept a write (TCP send buffer full / EAGAIN), the master stops reading the RDB pipe entirely, stalling data delivery to all replicas — including fast ones that are ready to receive data.

The failure reason is similar to #14946, the socket buffer is more easy to fill.

Root Cause

In rdbPipeReadHandler, the master reads from the child's RDB pipe and writes to all replica sockets in a loop. When connWrite to any replica returns a partial write (socket send buffer full), the handler:

  1. Installs a per-replica rdbPipeWriteHandler and increments rdb_pipe_numconns_writing
  2. Removes the pipe read event via aeDeleteFileEvent(server.el, server.rdb_pipe_read, AE_READABLE), stopping all pipe reads

The pipe read event is only re-enabled when all pending write handlers complete (rdb_pipe_numconns_writing == 0), meaning the slowest replica dictates the throughput for all replicas.

Observed Behavior

With one slow replica (consuming at ~290 KB/s due to key-load-delay):

  • Master bursts ~1.3 MB of RDB data until the slow replica's socket send buffer fills
  • rdbPipeReadHandler disables the pipe read event
  • All replicas starve for 4–5 seconds while the slow replica drains its buffer
  • Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the entire master and replica.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions