Skip to content

Test fix potential TCP deadlock in Active defrag IDMP streams test#14886

Merged
sundb merged 3 commits into
redis:unstablefrom
vitahlin:idmp-test
Mar 26, 2026
Merged

Test fix potential TCP deadlock in Active defrag IDMP streams test#14886
sundb merged 3 commits into
redis:unstablefrom
vitahlin:idmp-test

Conversation

@vitahlin

@vitahlin vitahlin commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Failed daily CI: https://github.com/redis/redis/actions/runs/22980491707/job/66718992772

The IDMP streams defrag test sends all commands (100k) before reading any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent with the approach already used in the script defrag test above.


Note

Low Risk
Low risk: changes are confined to a unit test and only adjust how replies are drained from a deferring client to avoid hangs in slow/TLS CI runs.

Overview
Improves the Active defrag IDMP streams test in tests/unit/memefficiency.tcl by batching reads while enqueueing XADD/SET commands, preventing TCP/congestion deadlocks when many deferred replies accumulate.

Replaces the single large reply-drain loop with periodic drains plus a final “remaining replies” drain, keeping test assertions and behavior the same while making CI runs less flaky.

Written by Cursor Bugbot for commit 9a12be9. This will update automatically on new commits. Configure here.

@augmentcode

augmentcode Bot commented Mar 12, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: Reduces flakiness in the “Active defrag IDMP streams” unit test (notably under TLS) by allowing more time for active defrag to complete.

Changes: Increases the defrag-stop wait budget and improves the timeout diagnostics by tracking and reporting observed fragmentation ratios.

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread tests/unit/memefficiency.tcl Outdated
puts [r memory malloc-stats]
if {$expect_frag != 0} {
fail "defrag didn't stop or failed to achieve expected frag ratio ([s allocator_frag_ratio] > $expect_frag)"
fail "defrag didn't stop or failed to achieve expected frag ratio ($last_frag > $expect_frag) and final_frag=[s allocator_frag_ratio]"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure string prints ($last_frag > $expect_frag) even when the timeout is actually due to active_defrag_running staying nonzero, so it can still produce confusing output like 1.06 > 1.1. Since last_running is captured, consider including it (or otherwise avoiding implying the > relation holds) to make failures unambiguous.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Comment thread tests/unit/memefficiency.tcl Outdated

# wait for the active defrag to stop working
wait_for_defrag_stop 500 100 1.1
wait_for_defrag_stop 1000 100 1.1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this to wait_for_defrag_stop 1000 100 1.1 raises the worst-case wait to ~100s, so a real defrag stall will hold CI much longer before failing. Worth confirming that longer failure latency is acceptable for this test suite.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread tests/unit/memefficiency.tcl Outdated
Comment thread tests/unit/memefficiency.tcl Outdated

# wait for the active defrag to stop working
wait_for_defrag_stop 500 100 1.1
wait_for_defrag_stop 1000 100 1.1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a timing issue, 50s is enough.
One possibility is that all the slabs have reached equilibrium, making it impossible to determine whether there is frag or not.
Can you provide the output of the logs and the output of memory malloc-stats when the test fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll run the tests locally to verify if there are any issues.

@sundb

sundb commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator

i reproduced this issue in my local, i think it's similar to #14667
please try the follwing patch

diff --git a/tests/unit/memefficiency.tcl b/tests/unit/memefficiency.tcl
index 123bf3751..5ed615577 100644
--- a/tests/unit/memefficiency.tcl
+++ b/tests/unit/memefficiency.tcl
@@ -602,15 +602,25 @@ run_solo {defrag} {
             # Populate memory with interleaving IDMP stream-key pattern of same size
             set dummy_iid "[string repeat x 400]"
             set rd [redis_deferring_client]
+            set batch_size 10000
             for {set j 0} {$j < $n} {incr j} {
                 set producer_id "producer[expr {$j % 10}]"
                 set iid "$dummy_iid[format "%06d" $j]"
                 $rd xadd idmpstream IDMP $producer_id $iid * field value
                 $rd set k$j $iid
+                
+                if {($j + 1) % $batch_size == 0} {
+                    for {set i 0} {$i < [expr {$batch_size * 2}]} {incr i} {
+                        $rd read
+                    }
+                }
             }
-            for {set j 0} {$j < [expr {$n * 2}]} {incr j} {
-                $rd read ; # Discard replies
+            # Read remaining replies
+            set remaining [expr {($n % $batch_size) * 2}]
+            for {set j 0} {$j < $remaining} {incr j} {
+                $rd read
             }
+            
             after 120 ;# serverCron only updates the info once in 100ms
             if {$::verbose} {
                 puts "used [s allocator_allocated]"

@sundb

sundb commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator

@vitahlin please also update the top comment, thx.

@vitahlin

Copy link
Copy Markdown
Contributor Author
             # Populate memory with interleaving IDMP stream-key pattern of same size

Cool, I've tested the suggested patch locally, and it successfully resolves the issue. The tests are now passing and the execution time is significantly faster than before.

@vitahlin

Copy link
Copy Markdown
Contributor Author

The CI failure is due to an existing bug in the current unstable branch: https://github.com/redis/redis/actions/runs/23570878820/job/68632915653

@vitahlin vitahlin changed the title Test flaky test Active defrag IDMP streams Test fix potential TCP deadlock in Active defrag IDMP streams test Mar 26, 2026
@sundb sundb merged commit bbc0dcb into redis:unstable Mar 26, 2026
17 of 18 checks passed
@vitahlin vitahlin deleted the idmp-test branch March 26, 2026 07:00
@sundb sundb mentioned this pull request Mar 29, 2026
sundb added a commit that referenced this pull request Mar 30, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
@sundb sundb moved this from Todo to pending in Redis 8.6 Backport Mar 30, 2026
pierluigilenoci pushed a commit to pierluigilenoci/redis that referenced this pull request Apr 16, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 8, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit to sundb/redis that referenced this pull request May 8, 2026
…14886)

The IDMP streams defrag test sends all commands (100k) before reading
any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent
with the approach already used in the script defrag test above.
EvanMGates pushed a commit to liftoffio/redis that referenced this pull request May 11, 2026
…14886)

The IDMP streams defrag test sends all commands (100k) before reading
any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent
with the approach already used in the script defrag test above.
EvanMGates pushed a commit to liftoffio/redis that referenced this pull request May 11, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to dannysheyn/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit to dannysheyn/redis that referenced this pull request May 12, 2026
…14886)

The IDMP streams defrag test sends all commands (100k) before reading
any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent
with the approach already used in the script defrag test above.
sundb added a commit to dannysheyn/redis that referenced this pull request May 12, 2026
sundb pushed a commit to sundb/redis that referenced this pull request May 12, 2026
…14886)

The IDMP streams defrag test sends all commands (100k) before reading
any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent
with the approach already used in the script defrag test above.
sundb added a commit to sundb/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 12, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit to sundb/redis that referenced this pull request May 13, 2026
…14886)

The IDMP streams defrag test sends all commands (100k) before reading
any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent
with the approach already used in the script defrag test above.
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit that referenced this pull request May 13, 2026
The IDMP streams defrag test sends all commands (100k) before reading
any replies, which can cause TCP deadlock when buffers fill up.

Fix by batching writes and reads (1000 iterations per batch), consistent
with the approach already used in the script defrag test above.
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2026
This fix follows redis#14667 and redis#14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 13, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 14, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb added a commit that referenced this pull request May 14, 2026
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
sundb pushed a commit that referenced this pull request Jun 2, 2026
### Issue
The module datatype defrag test sends 20k commands through a deferred
client before reading any replies. On slower CI environments this can
cause replies to accumulate and fill TCP/socket buffers, leading to
flaky `I/O error reading reply` failures.

### Change

Fix by batching deferred writes and reply drains, following the same
approach used in #14886.
@sundb sundb moved this from pending to Done in Redis 8.6 Backport Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants