Skip to content

Test tcp deadlock fixes#14946

Merged
sundb merged 4 commits into
redis:unstablefrom
sundb:ci-test-0328
Mar 30, 2026
Merged

Test tcp deadlock fixes#14946
sundb merged 4 commits into
redis:unstablefrom
sundb:ci-test-0328

Conversation

@sundb

@sundb sundb commented Mar 29, 2026

Copy link
Copy Markdown
Collaborator

This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP stalls.


Note

Low Risk
Low risk: changes are limited to test/helpers scripts, adjusting pipelining/read patterns to avoid socket buffer stalls without affecting production code paths.

Overview
Prevents test hangs caused by pipelining large numbers of commands on deferring clients without draining replies.

Updates several Tcl tests/helpers to periodically read/discard replies (generally every 500 requests), adds final reply-draining in gen_write_load, and tweaks batching parameters in memefficiency.tcl to reduce buffer pressure during large write bursts.

Written by Cursor Bugbot for commit 0192dbf. This will update automatically on new commits. Configure here.

@augmentcode

augmentcode Bot commented Mar 29, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: This PR updates several Tcl tests/helpers that use deferred (pipelined) clients to prevent TCP stalls/deadlocks caused by accumulating unread replies.

Changes:

  • Drain deferred-client replies periodically (every 500 pipelined requests) in the diskless-load SWAPDB cluster test.
  • Update gen_write_load to read deferred replies for setup commands and to periodically drain replies during the write loop.
  • Adjust the memory-efficiency/defrag tests to discard replies more frequently (smaller batching / lower discard thresholds) to keep buffers bounded.
  • Update a set-type unit test to periodically read replies while issuing many SREM commands in deferred mode.

Technical Notes: These changes aim to keep server/client output buffers from filling when many commands are pipelined without reading replies.

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread tests/cluster/tests/17-diskless-load-swapdb.tcl Outdated
Comment thread tests/unit/memefficiency.tcl Outdated
Comment thread tests/unit/memefficiency.tcl Outdated
@sundb

sundb commented Mar 29, 2026

Copy link
Copy Markdown
Collaborator Author

@sundb sundb requested a review from oranagra March 29, 2026 13:48

@oranagra oranagra left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, other then the comments about the generic workload.
Maybe im missing something (AFK)

set r [redis $host $port 1 $tls]
$r client setname LOAD_HANDLER
catch {$r select 9} ;# select 9 will fail in cluster mode
$r read

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see where we set the client to deferring mode.
But also, this change doesn't match the pr description (reading responces on deferring clients more often).
I think changing this workload from synchronous to pipeline can cause some issues and I'd rather have a separate workload that uses pipeline, and each test should choose which one to used (iirc we have such workload in ROF)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see where we set the client to deferring mode.

when creating the client, the 3th argument is 1, so it's a pipeline workload.

proc redis {{server 127.0.0.1} {port 6379} {defer 0} {tls 0} {tlsoptions {}} {readraw 0}}

proc gen_write_load {host port seconds tls {key ""} {size 0} {sleep 0}} {
    set start_time [clock seconds]
    set r [redis $host $port 1 $tls]

But also, this change doesn't match the pr description (reading responces on deferring clients more often).

Since this read will only be done once and may fail, if we don't call read here, it will be fail in the subsequent read.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't see a change in that line, and no read calls in the file, so i assumed it was non-deferring.
so this workload was always using pipeline, and just didn't read any response, ever?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't see a change in that line, and no read calls in the file, so i assumed it was non-deferring. so this workload was always using pipeline, and just didn't read any response, ever?

yes, didn't read anything.

}

incr count
if {$count % 500 == 0} {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we break, we leave some responces unread

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with ec71b0b, ignore the wrong commit title.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realize that we have exit 0 instead of break, so we don't need to read the unread replies.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so don't we wanna change it to break?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess you didn't realize it was an individual process and it exited when time was up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did, but break would also exit, and more cleanly (reading the remaining responses).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with 0192dbf

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Comment thread tests/helpers/gen_write_load.tcl
Co-authored-by: oranagra <oran@redislabs.com>
@sundb

sundb commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator Author

@sundb sundb merged commit a6a27f5 into redis:unstable Mar 30, 2026
18 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in Redis 8.8 Mar 30, 2026
@sundb sundb moved this from Todo to pending in Redis 8.6 Backport Mar 30, 2026
pierluigilenoci pushed a commit to pierluigilenoci/redis that referenced this pull request Apr 16, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
@sundb sundb moved this from Todo to pending in Redis 8.4 Backport May 5, 2026
sundb added a commit to sundb/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 14, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 14, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 19, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>  
Co-authored-by: debing.sun <debing.sun@redis.com>
@YaacovHazan YaacovHazan moved this from pending to Done in Redis 8.2 Backport Jun 3, 2026
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit to sundb/redis that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
@sundb sundb moved this from pending to Done in Redis 8.4 Backport Jun 4, 2026
@sundb sundb moved this from pending to Done in Redis 8.6 Backport Jun 4, 2026
sundb added a commit that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
sundb added a commit that referenced this pull request Jun 4, 2026
This PR is based on: valkey-io/valkey#3511
Close #14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
@sundb sundb deleted the ci-test-0328 branch June 29, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants