Test tcp deadlock fixes#14667
Merged
Merged
Conversation
The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The "Active defrag eval scripts" and "Active defrag pubsub" tests were hanging on slow machines due to the same TCP congestion control issue fixed in the maxmemory test. These tests send 50,000-100,000 pipelined commands without reading responses, which causes TCP buffers to fill. When buffer pressure causes packet drops, TCP congestion control misinterprets this as network congestion and throttles the connection (cwnd drops to 1, RTO grows exponentially), effectively freezing the connection. The fix interleaves reads with writes by processing responses every 1,000 commands, preventing TCP buffers from filling to the point where congestion control triggers pathological backoff behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
oranagra
approved these changes
Jan 6, 2026
oranagra
left a comment
Member
There was a problem hiding this comment.
LGTM.
we run into that problem recently in other tests and improved them, e.g. #14217 and #14231.
Each hardware or GH actions on a different fork exposes that issue in different area.
the only reason these tests are using pipeline is to speed up loading massive amount of data, if TCP and redis can't swallow all of it, we better read responses periodically.
@sundb FYI
sundb
approved these changes
Jan 7, 2026
CodeClaper
added a commit
to CodeClaper/redis
that referenced
this pull request
Jan 8, 2026
This reverts commit 154fdce.
This was referenced Mar 26, 2026
sundb
added a commit
that referenced
this pull request
Mar 30, 2026
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
qiongtubao
pushed a commit
to ctripcorp/Redis-On-Rocks
that referenced
this pull request
Apr 3, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
pierluigilenoci
pushed a commit
to pierluigilenoci/redis
that referenced
this pull request
Apr 16, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 8, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
EvanMGates
pushed a commit
to liftoffio/redis
that referenced
this pull request
May 11, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to dannysheyn/redis
that referenced
this pull request
May 12, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 12, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 12, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 12, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 12, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 12, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
pushed a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This was referenced May 13, 2026
sundb
added a commit
that referenced
this pull request
May 13, 2026
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sundb
added a commit
that referenced
this pull request
May 13, 2026
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
added a commit
to sundb/redis
that referenced
this pull request
May 13, 2026
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
that referenced
this pull request
May 13, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
added a commit
that referenced
this pull request
May 13, 2026
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
that referenced
this pull request
May 14, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
added a commit
that referenced
this pull request
May 14, 2026
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
sundb
pushed a commit
that referenced
this pull request
May 14, 2026
**Disclaimer: this patch was created with the help of AI** My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever. To read those replies from time to time allows to run the test on such older hardware. Ping oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see. IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this. The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock. Root cause: When the test client sends commands without reading responses: 1. Server processes commands and sends responses 2. Client's TCP receive buffer fills (client not reading) 3. Server's TCP send buffer fills 4. Packets get dropped due to buffer pressure 5. TCP congestion control interprets this as network congestion 6. cwnd (congestion window) drops to 1, RTO increases exponentially 7. After multiple backoffs, RTO reaches ~100 seconds 8. Connection becomes effectively frozen This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side. The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior. The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> (cherry picked from commit 154fdce)
sundb
added a commit
that referenced
this pull request
May 14, 2026
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disclaimer: this patch was created with the help of AI
My experience with the Redis test not passing on older hardware didn't stop just with the other PR opened with the same problem. There was another deadlock happening when the test was writing a lot of commands without reading it back, and the cause seems related to the fact that such tests have something in common. They create a deferred client (that does not read replies at all, if not asked to), flood the server with 1 million of requests without reading anything back. This results in a networking issue where the TCP socket stops accepting more data, and the test hangs forever.
To read those replies from time to time allows to run the test on such older hardware.
Ping @oranagra that introduced at least one of the bulk writes tests. AFAIK there is no problem in the test, if we change it in this way, since the slave buffer is going to be filled anyway. But better to be sure that it was not intentional to write all those data without reading back for some reason I can't see.
IMPORTANT NOTE: I am NOT sure at all that the TCP socket senses congestion in one side and also stops the other side, but anyway this fix works well and is likely a good idea in general. At the same time, I doubt there is a pending bug in Redis that makes it hang if the output buffer is too large, or we are flooding the system with too many commands without reading anything back. So the actual cause remains cloudy. I remember that Redis, when the output limit is reached, could kill the client, and not lower the priority of command processing. Maybe Oran knows more about this.
LLM commit message.
The test "slave buffer are counted correctly" was hanging indefinitely on slow machines. The test sends 1M pipelined commands without reading responses, which triggers a TCP-level deadlock.
Root cause: When the test client sends commands without reading responses:
This was confirmed by examining TCP socket state showing cwnd:1, backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.
The fix interleaves reads with writes by processing responses every 10,000 commands. This prevents TCP buffers from filling to the point where congestion control triggers the pathological backoff behavior.
The test still validates the same functionality (slave buffer memory accounting) since the measurement happens after all commands complete.
🤖 Generated with Claude Code