Skip to content

Stabilize diskless no-drop replication test#3511

Merged
zuiderkwast merged 7 commits into
valkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414
Apr 21, 2026
Merged

Stabilize diskless no-drop replication test#3511
zuiderkwast merged 7 commits into
valkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414

Conversation

@sarthakaggarwal97

@sarthakaggarwal97 sarthakaggarwal97 commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

This deflakes all variants of diskless replicas drop during rdb pipe.

The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.

CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

@codecov

codecov Bot commented Apr 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (6444717) to head (fe1d7e6).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3511      +/-   ##
============================================
- Coverage     76.58%   76.38%   -0.20%     
============================================
  Files           159      159              
  Lines         80019    80019              
============================================
- Hits          61283    61125     -158     
- Misses        18736    18894     +158     

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Nikhil-Manglore

Nikhil-Manglore commented Apr 15, 2026

Copy link
Copy Markdown
Member

Did this test fail recently, I know viktor attempted to fix it in this PR: #3461

Edit: I just saw the comments on the PR, looks like it did fail again recently.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?

Comment thread tests/integration/replication.tcl Outdated
Comment thread tests/integration/replication.tcl Outdated
Comment thread tests/integration/replication.tcl Outdated
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 14afd0b to 5a24b23 Compare April 16, 2026 16:45
@sarthakaggarwal97

sarthakaggarwal97 commented Apr 16, 2026

Copy link
Copy Markdown
Contributor Author

@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934

@sarthakaggarwal97

sarthakaggarwal97 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor Author

@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs.

@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 56ae8dc to 1e467b0 Compare April 17, 2026 04:04
@sarthakaggarwal97

Copy link
Copy Markdown
Contributor Author

This version is quite stable. Not seeing failures anymore.

@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2026

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! Sorry for the delay.

Comment thread tests/integration/replication.tcl
sarthakaggarwal97 and others added 7 commits April 21, 2026 09:54
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
The 'slow' subcase fallback was searching for a log message matching
'*Connection with replica client id * lost.*' but the actual server log
format is 'Connection with replica <host>:<port> lost.' — there is no
'client id' in the message. Fix the glob to '*Connection with replica * lost.*'.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
On fast ARM CI runners the RDB transfer completes before both replicas
are killed, so the primary logs '2 replicas still up' instead of
'last replica dropped' or '1 replicas still up'. Add a nested catch
fallback to accept all three possible outcomes.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 9a86d16 to fe1d7e6 Compare April 21, 2026 16:58

@Nikhil-Manglore Nikhil-Manglore left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sarthakaggarwal97

sarthakaggarwal97 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

The test failures are unrelated to this change. Related to #2936

@zuiderkwast zuiderkwast merged commit 03c2d4c into valkey-io:unstable Apr 21, 2026
63 of 66 checks passed
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 23, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 27, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
(cherry picked from commit 2b8df83)
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request May 3, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
(cherry picked from commit 2b8df83)
(cherry picked from commit 59f5375)
madolson pushed a commit that referenced this pull request May 6, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
(cherry picked from commit 2b8df83)
(cherry picked from commit b051910)
madolson pushed a commit that referenced this pull request May 6, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
(cherry picked from commit 2b8df83)
(cherry picked from commit 59f5375)
sundb added a commit to redis/redis that referenced this pull request May 19, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>  
Co-authored-by: debing.sun <debing.sun@redis.com>
zuiderkwast pushed a commit that referenced this pull request Jun 2, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 8.1 Jun 2, 2026
@zuiderkwast zuiderkwast moved this from To be backported to 8.1.8 (WIP) in Valkey 8.1 Jun 2, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.0 Jun 2, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 8.0 Jun 2, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 7.2 Jun 2, 2026
@zuiderkwast zuiderkwast moved this from To be backported to Done in Valkey 7.2 Jun 2, 2026
@zuiderkwast zuiderkwast moved this from To be backported to Done in Valkey 8.0 Jun 2, 2026
zuiderkwast pushed a commit that referenced this pull request Jun 2, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to redis/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to redis/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to redis/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 9, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 10, 2026
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
roshkhatri added a commit to roshkhatri/valkey that referenced this pull request Jun 30, 2026
Compression finishes the diskless transfer too fast to catch a replica
mid-transfer, so the diskless "slow" pipe-drop subcase cannot exercise its
intended scenario under compression (same as the fast/timeout subcases,
already excluded). Drop "slow" from the compressed matrix and revert the
wait-budget bump (50 100 -> 1 1), restoring the upstream valkey-io#3511 fallback,
since that branch no longer runs under compression. The mid-transfer drop
stays covered by the normal uncompressed job.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done
Status: Done
Status: 8.1.8
Status: To be backported

Development

Successfully merging this pull request may close these issues.

4 participants