Skip to content

Test tcp deadlock fixes#14667

Merged
sundb merged 2 commits into
redis:unstablefrom
antirez:test-tcp-deadlock-fixes
Jan 7, 2026
Merged

Test tcp deadlock fixes#14667
sundb merged 2 commits into
redis:unstablefrom
antirez:test-tcp-deadlock-fixes

Conversation

@antirez

@antirez antirez commented Jan 6, 2026

Copy link
Copy Markdown
Contributor

Disclaimer: this patch was created with the help of AI

My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever.

To read those replies from time to time allows to run the test on such older hardware.

Ping @oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see.

IMPORTANT NOTE: I am NOT sure at all that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this.

LLM commit message.

The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading responses:

  1. Server processes commands and sends responses
  2. Client's TCP receive buffer fills (client not reading)
  3. Server's TCP send buffer fills
  4. Packets get dropped due to buffer pressure
  5. TCP congestion control interprets this as network congestion
  6. cwnd (congestion window) drops to 1, RTO increases exponentially
  7. After multiple backoffs, RTO reaches ~100 seconds
  8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete.

🤖 Generated with Claude Code

antirez and others added 2 commits January 6, 2026 17:23
The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The "Active defrag eval scripts" and "Active defrag pubsub" tests were
hanging on slow machines due to the same TCP congestion control issue
fixed in the maxmemory test.

These tests send 50,000-100,000 pipelined commands without reading
responses, which causes TCP buffers to fill. When buffer pressure causes
packet drops, TCP congestion control misinterprets this as network
congestion and throttles the connection (cwnd drops to 1, RTO grows
exponentially), effectively freezing the connection.

The fix interleaves reads with writes by processing responses every
1,000 commands, preventing TCP buffers from filling to the point where
congestion control triggers pathological backoff behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

@oranagra oranagra left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
we run into that problem recently in other tests and improved them, e.g. #14217 and #14231.
Each hardware or GH actions on a different fork exposes that issue in different area.
the only reason these tests are using pipeline is to speed up loading massive amount of data, if TCP and redis can't swallow all of it, we better read responses periodically.
@sundb FYI

@sundb sundb merged commit 154fdce into redis:unstable Jan 7, 2026
18 checks passed
CodeClaper added a commit to CodeClaper/redis that referenced this pull request Jan 8, 2026
sundb added a commit that referenced this pull request Mar 30, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
qiongtubao pushed a commit to ctripcorp/Redis-On-Rocks that referenced this pull request Apr 3, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
pierluigilenoci pushed a commit to pierluigilenoci/redis that referenced this pull request Apr 16, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 8, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
EvanMGates pushed a commit to liftoffio/redis that referenced this pull request May 11, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to dannysheyn/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit to sundb/redis that referenced this pull request May 12, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb added a commit to sundb/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit to sundb/redis that referenced this pull request May 12, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb pushed a commit to sundb/redis that referenced this pull request May 12, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb added a commit to sundb/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit that referenced this pull request May 13, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit that referenced this pull request May 14, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb added a commit that referenced this pull request May 14, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit that referenced this pull request May 14, 2026
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
(cherry picked from commit 154fdce)
sundb added a commit that referenced this pull request May 14, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants