Test fix potential TCP deadlock in Active defrag IDMP streams test by vitahlin · Pull Request #14886 · redis/redis

vitahlin · 2026-03-12T15:27:58Z

Failed daily CI: https://github.com/redis/redis/actions/runs/22980491707/job/66718992772

The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.

Note

Low Risk
Low risk: changes are confined to a unit test and only adjust how replies are drained from a deferring client to avoid hangs in slow/TLS CI runs.

Overview
Improves the Active defrag IDMP streams test in tests/unit/memefficiency.tcl by batching reads while enqueueing XADD/SET commands, preventing TCP/congestion deadlocks when many deferred replies accumulate.

Replaces the single large reply-drain loop with periodic drains plus a final “remaining replies” drain, keeping test assertions and behavior the same while making CI runs less flaky.

^{Written by Cursor Bugbot for commit 9a12be9. This will update automatically on new commits. Configure here.}

augmentcode · 2026-03-12T15:30:17Z

🤖 Augment PR Summary

Summary: Reduces flakiness in the “Active defrag IDMP streams” unit test (notably under TLS) by allowing more time for active defrag to complete.

Changes: Increases the defrag-stop wait budget and improves the timeout diagnostics by tracking and reporting observed fragmentation ratios.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-12T15:30:18Z

            puts [r memory malloc-stats]
            if {$expect_frag != 0} {
-                fail "defrag didn't stop or failed to achieve expected frag ratio ([s allocator_frag_ratio] > $expect_frag)"
+                fail "defrag didn't stop or failed to achieve expected frag ratio ($last_frag > $expect_frag) and final_frag=[s allocator_frag_ratio]"


The failure string prints ($last_frag > $expect_frag) even when the timeout is actually due to active_defrag_running staying nonzero, so it can still produce confusing output like 1.06 > 1.1. Since last_running is captured, consider including it (or otherwise avoiding implying the > relation holds) to make failures unambiguous.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-03-12T15:30:18Z


                # wait for the active defrag to stop working
-                wait_for_defrag_stop 500 100 1.1
+                wait_for_defrag_stop 1000 100 1.1


Bumping this to wait_for_defrag_stop 1000 100 1.1 raises the worst-case wait to ~100s, so a real defrag stall will hold CI much longer before failing. Worth confirming that longer failure latency is acceptable for this test suite.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

sundb · 2026-03-13T01:55:50Z


                # wait for the active defrag to stop working
-                wait_for_defrag_stop 500 100 1.1
+                wait_for_defrag_stop 1000 100 1.1


I don't think it's a timing issue, 50s is enough.
One possibility is that all the slabs have reached equilibrium, making it impossible to determine whether there is frag or not.
Can you provide the output of the logs and the output of memory malloc-stats when the test fails.

Got it. I'll run the tests locally to verify if there are any issues.

sundb · 2026-03-26T03:53:46Z

i reproduced this issue in my local, i think it's similar to #14667
please try the follwing patch

diff --git a/tests/unit/memefficiency.tcl b/tests/unit/memefficiency.tcl
index 123bf3751..5ed615577 100644
--- a/tests/unit/memefficiency.tcl
+++ b/tests/unit/memefficiency.tcl
@@ -602,15 +602,25 @@ run_solo {defrag} {
             # Populate memory with interleaving IDMP stream-key pattern of same size
             set dummy_iid "[string repeat x 400]"
             set rd [redis_deferring_client]
+            set batch_size 10000
             for {set j 0} {$j < $n} {incr j} {
                 set producer_id "producer[expr {$j % 10}]"
                 set iid "$dummy_iid[format "%06d" $j]"
                 $rd xadd idmpstream IDMP $producer_id $iid * field value
                 $rd set k$j $iid
+                
+                if {($j + 1) % $batch_size == 0} {
+                    for {set i 0} {$i < [expr {$batch_size * 2}]} {incr i} {
+                        $rd read
+                    }
+                }
             }
-            for {set j 0} {$j < [expr {$n * 2}]} {incr j} {
-                $rd read ; # Discard replies
+            # Read remaining replies
+            set remaining [expr {($n % $batch_size) * 2}]
+            for {set j 0} {$j < $remaining} {incr j} {
+                $rd read
             }
+            
             after 120 ;# serverCron only updates the info once in 100ms
             if {$::verbose} {
                 puts "used [s allocator_allocated]"

This reverts commit b3514f3.

sundb · 2026-03-26T06:41:28Z

@vitahlin please also update the top comment, thx.

vitahlin · 2026-03-26T06:44:42Z

             # Populate memory with interleaving IDMP stream-key pattern of same size

Cool, I've tested the suggested patch locally, and it successfully resolves the issue. The tests are now passing and the execution time is significantly faster than before.

vitahlin · 2026-03-26T06:57:24Z

The CI failure is due to an existing bug in the current unstable branch: https://github.com/redis/redis/actions/runs/23570878820/job/68632915653

This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.

This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.

…redis#14886)" This reverts commit 6f02f7f.

…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.

This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

…14886) The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.

This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up. Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.

This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

This fix follows redis#14667 and redis#14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>

### Issue The module datatype defrag test sends 20k commands through a deferred client before reading any replies. On slower CI environments this can cause replies to accumulate and fill TCP/socket buffers, leading to flaky `I/O error reading reply` failures. ### Change Fix by batching deferred writes and reply drains, following the same approach used in #14886.

flaky test Active defrag IDMP streams

b3514f3

augmentcode Bot reviewed Mar 12, 2026

View reviewed changes

cursor Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread tests/unit/memefficiency.tcl Outdated

sundb reviewed Mar 13, 2026

View reviewed changes

vitahlin added 2 commits March 26, 2026 14:12

fx

863646d

Revert "flaky test Active defrag IDMP streams"

9a12be9

This reverts commit b3514f3.

sundb approved these changes Mar 26, 2026

View reviewed changes

vitahlin changed the title ~~Test flaky test Active defrag IDMP streams~~ Test fix potential TCP deadlock in Active defrag IDMP streams test Mar 26, 2026

sundb merged commit bbc0dcb into redis:unstable Mar 26, 2026
17 of 18 checks passed

vitahlin deleted the idmp-test branch March 26, 2026 07:00

sundb mentioned this pull request Mar 29, 2026

Test tcp deadlock fixes #14946

Merged

sundb added this to Redis 8.6 Backport Mar 30, 2026

github-project-automation Bot moved this to Todo in Redis 8.6 Backport Mar 30, 2026

sundb moved this from Todo to pending in Redis 8.6 Backport Mar 30, 2026

sundb added a commit to dannysheyn/redis that referenced this pull request May 12, 2026

Revert "Fix potential TCP deadlock in Active defrag IDMP streams test (…

41f35f4

…redis#14886)" This reverts commit 6f02f7f.

sundb mentioned this pull request May 12, 2026

Backport CI fix commits 8.6 #15192

Merged

vitahlin mentioned this pull request May 27, 2026

Fix potential TCP deadlock in module datatype defrag test #15274

Merged

sundb moved this from pending to Done in Redis 8.6 Backport Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test fix potential TCP deadlock in Active defrag IDMP streams test#14886

Test fix potential TCP deadlock in Active defrag IDMP streams test#14886
sundb merged 3 commits into
redis:unstablefrom
vitahlin:idmp-test

vitahlin commented Mar 12, 2026 •

edited

Loading

Uh oh!

augmentcode Bot commented Mar 12, 2026

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot Mar 12, 2026

Uh oh!

augmentcode Bot Mar 12, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

sundb Mar 13, 2026

Uh oh!

vitahlin Mar 15, 2026

Uh oh!

sundb commented Mar 26, 2026

Uh oh!

sundb commented Mar 26, 2026

Uh oh!

vitahlin commented Mar 26, 2026

Uh oh!

vitahlin commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vitahlin commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot commented Mar 12, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sundb Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

vitahlin Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

sundb commented Mar 26, 2026

Uh oh!

sundb commented Mar 26, 2026

Uh oh!

vitahlin commented Mar 26, 2026

Uh oh!

vitahlin commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vitahlin commented Mar 12, 2026 •

edited

Loading