Test fix potential TCP deadlock in Active defrag IDMP streams test#14886
Conversation
🤖 Augment PR SummarySummary: Reduces flakiness in the “Active defrag IDMP streams” unit test (notably under TLS) by allowing more time for active defrag to complete. Changes: Increases the defrag-stop wait budget and improves the timeout diagnostics by tracking and reporting observed fragmentation ratios. 🤖 Was this summary useful? React with 👍 or 👎 |
| puts [r memory malloc-stats] | ||
| if {$expect_frag != 0} { | ||
| fail "defrag didn't stop or failed to achieve expected frag ratio ([s allocator_frag_ratio] > $expect_frag)" | ||
| fail "defrag didn't stop or failed to achieve expected frag ratio ($last_frag > $expect_frag) and final_frag=[s allocator_frag_ratio]" |
There was a problem hiding this comment.
The failure string prints ($last_frag > $expect_frag) even when the timeout is actually due to active_defrag_running staying nonzero, so it can still produce confusing output like 1.06 > 1.1. Since last_running is captured, consider including it (or otherwise avoiding implying the > relation holds) to make failures unambiguous.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| # wait for the active defrag to stop working | ||
| wait_for_defrag_stop 500 100 1.1 | ||
| wait_for_defrag_stop 1000 100 1.1 |
There was a problem hiding this comment.
Bumping this to wait_for_defrag_stop 1000 100 1.1 raises the worst-case wait to ~100s, so a real defrag stall will hold CI much longer before failing. Worth confirming that longer failure latency is acceptable for this test suite.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| # wait for the active defrag to stop working | ||
| wait_for_defrag_stop 500 100 1.1 | ||
| wait_for_defrag_stop 1000 100 1.1 |
There was a problem hiding this comment.
I don't think it's a timing issue, 50s is enough.
One possibility is that all the slabs have reached equilibrium, making it impossible to determine whether there is frag or not.
Can you provide the output of the logs and the output of memory malloc-stats when the test fails.
There was a problem hiding this comment.
Got it. I'll run the tests locally to verify if there are any issues.
|
i reproduced this issue in my local, i think it's similar to #14667 diff --git a/tests/unit/memefficiency.tcl b/tests/unit/memefficiency.tcl
index 123bf3751..5ed615577 100644
--- a/tests/unit/memefficiency.tcl
+++ b/tests/unit/memefficiency.tcl
@@ -602,15 +602,25 @@ run_solo {defrag} {
# Populate memory with interleaving IDMP stream-key pattern of same size
set dummy_iid "[string repeat x 400]"
set rd [redis_deferring_client]
+ set batch_size 10000
for {set j 0} {$j < $n} {incr j} {
set producer_id "producer[expr {$j % 10}]"
set iid "$dummy_iid[format "%06d" $j]"
$rd xadd idmpstream IDMP $producer_id $iid * field value
$rd set k$j $iid
+
+ if {($j + 1) % $batch_size == 0} {
+ for {set i 0} {$i < [expr {$batch_size * 2}]} {incr i} {
+ $rd read
+ }
+ }
}
- for {set j 0} {$j < [expr {$n * 2}]} {incr j} {
- $rd read ; # Discard replies
+ # Read remaining replies
+ set remaining [expr {($n % $batch_size) * 2}]
+ for {set j 0} {$j < $remaining} {incr j} {
+ $rd read
}
+
after 120 ;# serverCron only updates the info once in 100ms
if {$::verbose} {
puts "used [s allocator_allocated]" |
This reverts commit b3514f3.
|
@vitahlin please also update the top comment, thx. |
Cool, I've tested the suggested patch locally, and it successfully resolves the issue. The tests are now passing and the execution time is significantly faster than before. |
|
The CI failure is due to an existing bug in the current unstable branch: https://github.com/redis/redis/actions/runs/23570878820/job/68632915653 |
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
…redis#14886)" This reverts commit 6f02f7f.
…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>
### Issue The module datatype defrag test sends 20k commands through a deferred client before reading any replies. On slower CI environments this can cause replies to accumulate and fill TCP/socket buffers, leading to flaky `I/O error reading reply` failures. ### Change Fix by batching deferred writes and reply drains, following the same approach used in #14886.
Failed daily CI: https://github.com/redis/redis/actions/runs/22980491707/job/66718992772
The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up.
Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.
Note
Low Risk
Low risk: changes are confined to a unit test and only adjust how replies are drained from a deferring client to avoid hangs in slow/TLS CI runs.
Overview
Improves the
Active defrag IDMP streamstest intests/unit/memefficiency.tclby batching reads while enqueueingXADD/SETcommands, preventing TCP/congestion deadlocks when many deferred replies accumulate.Replaces the single large reply-drain loop with periodic drains plus a final “remaining replies” drain, keeping test assertions and behavior the same while making CI runs less flaky.
Written by Cursor Bugbot for commit 9a12be9. This will update automatically on new commits. Configure here.