Skip to content

Fix diskless replicas drop during rdb pipe test#15131

Merged
sundb merged 17 commits into
redis:unstablefrom
vitahlin:fix-rep
May 19, 2026
Merged

Fix diskless replicas drop during rdb pipe test#15131
sundb merged 17 commits into
redis:unstablefrom
vitahlin:fix-rep

Conversation

@vitahlin

@vitahlin vitahlin commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

This PR is based on: valkey-io/valkey#3511
Close #14983

Summary

During diskless replication, if any single replica cannot accept a write (TCP send buffer full / EAGAIN), the master stops reading the RDB pipe entirely, stalling data delivery to all replicas — including fast ones that are ready to receive data.

The failure reason is similar to #14946, the socket buffer is more easy to fill.

Root Cause

In rdbPipeReadHandler, the master reads from the child's RDB pipe and writes to all replica sockets in a loop. When connWrite to any replica returns a partial write (socket send buffer full), the handler:

  1. Installs a per-replica rdbPipeWriteHandler and increments rdb_pipe_numconns_writing
  2. Removes the pipe read event via aeDeleteFileEvent(server.el, server.rdb_pipe_read, AE_READABLE), stopping all pipe reads

The pipe read event is only re-enabled when all pending write handlers complete (rdb_pipe_numconns_writing == 0), meaning the slowest replica dictates the throughput for all replicas.

Observed Behavior

With one slow replica (consuming at ~290 KB/s due to key-load-delay):

  • Master bursts ~1.3 MB of RDB data until the slow replica's socket send buffer fills
  • rdbPipeReadHandler disables the pipe read event
  • All replicas starve for 4–5 seconds while the slow replica drains its buffer
  • Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the entire master and replica.

Changes

  1. Skip the entire diskless replicas drop during rdb pipe test under Valgrind to avoid timing flakiness on slow env.

  2. Move start_server inside the foreach all_drop loop so each subcase gets a fresh master instead of sharing state across subcases.

  3. For no / slow / fast / all subcases, replica 0 runs with key-load-delay 500, which combined with the blocked-writer TCP back-pressure can stall the RDB-saving child indefinitely; shrink the dataset to ~40 MB so the transfer still exercises the blocked-writer path but completes in reasonable time instead of hanging on the TCP deadlock.
    For the timeout subcase, replica 0 does not run with key-load-delay 500, so to avoid the TCP deadlock we still reduce the dataset somewhat, but keep it larger than the other subcases. Otherwise the kernel TCP send buffer can absorb the whole RDB, and we'd miss the repl_last_partial_write != 0 "(full sync)" timeout path and only hit the "(streaming sync)" path instead.

  4. For the all subcase, set rdb-key-save-delay 1000 on the master so the RDB child keeps generating data while both replicas are killed, ensuring the last-replica-drop path is exercised rather than racing with normal completion.

  5. Move the slow-replica pause_process() so it happens only in the timeout subcase, not after killing replicas, so Redis observes the disconnect promptly in non-timeout flows.

  6. In the timeout subcase, set repl-timeout 2, wait inline for *Disconnecting timedout replica (full sync)*, then restore repl-timeout 60 so the remaining replica can finish the streamed RDB.

Co-authored-by: Sarthak Aggarwal sarthagg@amazon.com


Note

Low Risk
Low risk: changes are confined to tests/integration/replication.tcl and primarily adjust timing, payload size, and assertions to reduce flakiness on slow runners.

Overview
Stabilizes the diskless * replicas drop during rdb pipe integration test to reduce timing/log-order flakiness on slow (notably TLS) runners.

The test now varies RDB size per subcase (larger for timeout, smaller otherwise), resets repl-timeout/rdb-key-save-delay at the start of each subcase to avoid config leakage, and changes the timeout path to explicitly wait for the "(full sync)" timeout log before restoring a safe timeout. It also makes slow-reader behavior deterministic via key-load-delay (instead of oversized datasets), increases budgets for child-exit and replica-ONLINE waits (especially for the no case), fixes the CPU-usage assertion case label (none -> no), and skips this whole test block under Valgrind.

Reviewed by Cursor Bugbot for commit 55ba990. Bugbot is set up for automated code reviews on this repo. Configure here.

@augmentcode

augmentcode Bot commented Apr 28, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: This PR hardens the integration test for “diskless replicas drop during rdb pipe” by reducing timing sensitivity on slow (notably TLS) CI runners.

Changes:

  • Reduces the populated dataset size while keeping the transfer large enough to exercise the blocked-writer diskless sync path.
  • Uses key-load-delay for the “no”/“fast” subcases and limits SIGSTOP/SIGCONT usage to the subcases that actually terminate or time out a replica.
  • Handles races where the diskless pipe can complete before the killed replica is processed, and adds a connected_slaves-based fallback for the “slow” case.
  • Makes the “timeout” subcase wait for the timeout-disconnect log and restores repl-timeout, also resetting it at the start of each subcase.

Technical Notes: Extends wait budgets where needed and increases the ONLINE wait specifically for the “no” subcase to reduce flaky ordering/timing failures.

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Made-with: Cursor
Comment thread tests/integration/replication.tcl Outdated
if {$all_drop == "all"} {
wait_for_log_messages -2 {"*Diskless rdb transfer, last replica dropped, killing fork child*"} $loglines 1 1
if {[catch {
wait_for_log_messages -2 {"*Diskless rdb transfer, last replica dropped, killing fork child*"} $loglines 1 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the purpose of the test was to make sure that code path (last replica drops) is covered.
if we let the test pass even if the flow isn't reached, we're uncovered.
am i missing something?

p.s. wait_for_log_messages can wait for one of a list of messages, i.e. it takes a list, no need for the catch chain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the catch chain defeats the point of this subcase — sorry about that.
The fallback was masking a flake caused by the smaller RDB sometimes finishing before both kills landed. Latest commit drops the chain and sets rdb-key-save-delay 1000 on the master (only for the all subcase) to keep the RDB child producing long enough for both kills to win the race. The subcase now only passes if last replica dropped, killing fork child is actually logged.
Also noted on wait_for_log_messages taking a list — thanks!

Comment thread tests/integration/replication.tcl Outdated
Comment on lines +1026 to +1028
wait_for_log_messages -2 {"*Diskless rdb transfer, done reading from pipe, 2 replicas still up*"} $loglines 1 1
wait_for_condition 100 100 {
[s -2 connected_slaves] == 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand. how come the original code expected just one replica to be up in the "fast" case?
i'm probably forgetting something about this test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not missing anything — the original 1 replicas still up is correct.
I had relaxed it as a flake workaround when the smaller RDB made the kill-vs-EOF window too tight.
Latest commit reverts to the original strict assertion and bumps key-load-delay to 500 µs × 4000 keys so the killed replica's death is observed before EOF deterministically.

@vitahlin

vitahlin commented May 2, 2026

Copy link
Copy Markdown
Contributor Author

Running CI with --loops 200 --single integration/replication to ensure there are no more flakes. Latest run: https://github.com/vitahlin/redis/actions/runs/25255897551

@oranagra oranagra left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. assuming it still achieves it's purpose. (re-tested).

@@ -923,7 +928,24 @@ start_server {tags {"repl external:skip tsan:skip"} overrides {save ""}} {
set loglines [count_log_lines -2]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much this would help, but $loglines is currently defined far from where it's first used, and it appears before [s -2 rdb_bgsave_in_progress] == 0. Maybe try moving it to after that point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I'll try it out locally

@vitahlin

vitahlin commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Still seeing flaky failures with --loops 200. I'll investigate further.

@vitahlin

vitahlin commented May 8, 2026

Copy link
Copy Markdown
Contributor Author
Updated the diskless RDB pipe test to make the timing deterministic while keeping the assertions strict.

In particular, the all case still requires the last replica dropped, killing fork child path, and the test no longer accepts the normal EOF path as an alternative. I also avoided SIGKILL so Valgrind runs can terminate cleanly.

branch command result
PR --loops 50 --single integration/replication https://github.com/vitahlin/redis/actions/runs/25534989931
PR --single integration/replication https://github.com/vitahlin/redis/actions/runs/25532447736
unstable --single integration/replication https://github.com/vitahlin/redis/actions/runs/25534550026
unstable --loops 50 --single integration/replication https://github.com/vitahlin/redis/actions/runs/25536526951

The daily PR result is based on commit d55872e. The two latest commits, 15da4e9 and 1595824, only remove my temporary test file and update comments.

@vitahlin

vitahlin commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

We now have quite a few scenario-specific adjustments, so the test looks a bit longer. Since this is a test, I think it is acceptable to keep it this way.

Alternatively, would it be cleaner to split the different scenarios, such as no and fast, into separate helper paths/functions?

@sundb

sundb commented May 8, 2026

Copy link
Copy Markdown
Collaborator

We now have quite a few scenario-specific adjustments, so the test looks a bit longer. Since this is a test, I think it is acceptable to keep it this way.

Alternatively, would it be cleaner to split the different scenarios, such as no and fast, into separate helper paths/functions?

It's enough to keep the current.

@vitahlin vitahlin requested a review from oranagra May 8, 2026 09:39
Comment thread tests/integration/replication.tcl Outdated
Comment on lines +931 to +942
if {$all_drop == "all" || $all_drop == "slow"} {
# Give the master enough time to process replica
# disconnects before the RDB child reaches EOF,
# without increasing the SIGSTOP backlog window.
$master config set rdb-key-save-delay 2000
} elseif {$all_drop == "timeout"} {
# Keep the RDB child generating data long enough for
# replicas to time out before the pipe
# reaches EOF, so drop subcases cover their intended
# pipe paths instead of racing with normal completion.
$master config set rdb-key-save-delay 1000
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this fix is incorrect. This test is verifying the behavior of the replica. We should not slow down the speed at which the master fills data into the replica.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Perhaps reverting to 844028f is the better approach here.

@vitahlin

Copy link
Copy Markdown
Contributor Author

While this PR reduces the failure rate compared to the unstable branch, flaky test failures may still occur occasionally.

@sundb

sundb commented May 18, 2026

Copy link
Copy Markdown
Collaborator

@oranagra I increased some thresholds and disabled valgrind from running. It seems to be more stable now. Please check if there are any missing.

@sundb

sundb commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Fully CI for this PR just with replication test: https://github.com/redis/redis/actions/runs/26084314194

@sundb

sundb commented May 19, 2026

Copy link
Copy Markdown
Collaborator

@vitahlin updated the top comment, pls take a look, thx.

@sundb sundb changed the title Fix test diskless no replicas drop during rdb pipe Fix diskless replicas drop during rdb pipe test May 19, 2026
@vitahlin

Copy link
Copy Markdown
Contributor Author

@vitahlin updated the top comment, pls take a look, thx.

LGTM, thanks

@sundb sundb merged commit 3189614 into redis:unstable May 19, 2026
18 checks passed
@vitahlin vitahlin deleted the fix-rep branch May 20, 2026 14:58
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
@sundb sundb moved this from Todo to Done in Redis 8.6 Backport Jun 4, 2026
sundb added a commit that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
@sundb sundb moved this from Todo to Done in Redis 8.4 Backport Jun 4, 2026
sundb added a commit that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
@sundb sundb moved this from Todo to Done in Redis 8.2 Backport Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

[TEST] diskless $all_drop replicas drop during rdb pipe test failed

4 participants