Fix some daily CI issues #14217

sundb · 2025-07-23T13:13:27Z

Fix the timeout of Active defrag big keys: standalone
[BUG] Frequent test TIMEOUTs in 7.2, 7.4 and 8.0 #14196
Using a pipe to write commands may cause the write to block if the read buffer becomes full.
Fix the failure of Main db not affected when fail to diskless load test
https://github.com/redis/redis/actions/runs/16458252382/job/46525605155
If the master was killed in slow environment, then after cluster-node-timeout (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error.

FAILED: caught an error in the test CLUSTERDOWN The cluster is down
CLUSTERDOWN The cluster is down
    while executing
"$replica get $slot0_key"
    ("uplevel" body line 68)
    invoked from within
"uplevel 1 $code"
    (procedure "test" line 6)
    invoked from within
"test "Main db not affected when fail to diskless load" {
    set master [Rn 0]
    set replica [Rn 1]
    set master_id 0
    set replica_id 1

    $r..."
    (file "../tests/17-diskless-load-swapdb.tcl" line 17)
    invoked from within

Fix failure of Test shutdown hook test
https://github.com/sundb/redis/actions/runs/16475248579/job/46574965184
ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.

Please note that the crash occurred in ASAN

611065:M 24 Jul 2025 17:11:39.142 * User requested shutdown...
611065:M 24 Jul 2025 17:11:39.142 # <testhook> module-event-shutdown
611065:M 24 Jul 2025 17:11:39.142 * Removing the pid file.
611065:M 24 Jul 2025 17:11:39.142 * Removing the unix socket file.
611065:M 24 Jul 2025 17:11:39.142 # Redis is now ready to exit, bye bye...


=== REDIS BUG REPORT START: Cut & paste starting from here ===
611065:M 24 Jul 2025 17:11:52.474 # Redis 255.255.255 crashed by signal: 11, si_code: 0
611065:M 24 Jul 2025 17:11:52.474 # Accessing address: 0x3e800096438
611065:M 24 Jul 2025 17:11:52.474 # Killed by PID: 615480, UID: 1000
611065:M 24 Jul 2025 17:11:52.474 # Crashed running the instruction at: 0x7f66c1f0e7db

------ STACK TRACE ------
EIP:
/lib/x86_64-linux-gnu/libc.so.6(__sched_yield+0xb)[0x7f66c1f0e7db]

sundb · 2025-07-23T13:13:37Z

fully CI: https://github.com/sundb/redis/actions/runs/16470296185

snyk-io · 2025-07-23T13:13:58Z

🎉 Snyk checks have passed. No issues have been found so far.

✅ security/snyk check is complete. No issues have been found. (View Details)

✅ license/snyk check is complete. No issues have been found. (View Details)

kaplanben · 2025-07-23T13:49:16Z

Checkmarx One – Scan Summary & Details – 1e69da7b-8f0d-4a9d-83d4-ce45330431a5

New Issues (14)

Checkmarx found the following issues in this Pull Request

Issue	Source File / Package	Checkmarx Insight
Buffer_Improper_Index_Access	/src/server.c: 1176	details The array index ClientsPeakMemOutput at /src/server.c in line 1176 is used to reference an index of a cell of the array ClientsPeakMemOutput at /s... `ID: GgeazR3MvmKzsrJAVgSUuiBGeMU%3D` Attack Vector
Buffer_Improper_Index_Access	/src/server.c: 1175	details The array index ClientsPeakMemInput at /src/server.c in line 1175 is used to reference an index of a cell of the array ClientsPeakMemInput at /src... `ID: udAjGVE9braRBX9rGBu24bIk31A%3D` Attack Vector
Buffer_Improper_Index_Access	/src/server.c: 7315	details The array index argv at /src/server.c in line 7315 is used to reference an index of a cell of the array s at /src/sds.h in line 75 in a way that ... `ID: Hy4boAzcPUf16Hnvb5oV11ktT64%3D` Attack Vector
Buffer_Improper_Index_Access	/src/server.c: 7315	details The array index argv at /src/server.c in line 7315 is used to reference an index of a cell of the array s at /src/sds.h in line 69 in a way that ... `ID: esJS1r1KprXVsVpubfWKIpmvlMA%3D` Attack Vector
Buffer_Improper_Index_Access	/src/server.c: 7315	details The array index argv at /src/server.c in line 7315 is used to reference an index of a cell of the array s at /src/sds.c in line 216 in a way that... `ID: 8%2Brnv7hEl03TWpL4kSiDgRhFLEg%3D` Attack Vector
Buffer_Improper_Index_Access	/src/server.c: 162	details The array index syslogLevelMap at /src/server.c in line 162 is used to reference an index of a cell of the array syslogLevelMap at /src/server.c ... `ID: 5IWtqSlxOwlydz4Iyi%2BMhJbqDeY%3D` Attack Vector
Buffer_Improper_Index_Access	/src/server.c: 1011	details The array index stat_clients_type_memory at /src/server.c in line 1011 is used to reference an index of a cell of the array stat_clients_type... `ID: OwtzfKoGd5j%2Fc9qaD6qR20CnpsI%3D` Attack Vector
Buffer_Overflow_Wrong_Buffer_Size	/src/redis-cli.c: 3676	details The buffer buf created in /src/redis-cli.c at line 3676 is written to a buffer in /deps/hiredis/sds.c at line 234 by hdrlen, but an error in cal... `ID: MqnP7QsVuVyOykiPzdkQCxsw7H4%3D` Attack Vector
Buffer_Overflow_Wrong_Buffer_Size	/deps/linenoise/linenoise.c: 1200	details The buffer buf created in /deps/linenoise/linenoise.c at line 1200 is written to a buffer in /deps/hiredis/sds.c at line 97 by sh, but an error i... `ID: L1dS1AWIau3DXfxVZMTussbT9bE%3D` Attack Vector
Buffer_Overflow_Wrong_Buffer_Size	/src/redis-cli.c: 3676	details The buffer buf created in /src/redis-cli.c at line 3676 is written to a buffer in /deps/hiredis/sds.c at line 234 by newsh, but an error in calc... `ID: MB1M%2FHPv8FdnbXxmZA9TAV1%2BzTs%3D` Attack Vector
Buffer_Overflow_Wrong_Buffer_Size	/src/redis-cli.c: 10588	details The buffer argv created in /src/redis-cli.c at line 10588 is written to a buffer in /deps/hiredis/sds.c at line 97 by sh, but an error in calcul... `ID: vC%2FFYG%2FXR4U4ciJLaAPiJS9TD80%3D` Attack Vector
Buffer_Overflow_Wrong_Buffer_Size	/deps/linenoise/linenoise.c: 1166	details The buffer fgetc created in /deps/linenoise/linenoise.c at line 1166 is written to a buffer in /deps/hiredis/sds.c at line 97 by sh, but an error... `ID: GntXTIkoxcin9A6wr5%2FfrOjRD7o%3D` Attack Vector
Divide_By_Zero	/modules/vector-sets/fastjson_test.c: 121	details The application performs an illegal operation in generate_random_string, in /modules/vector-sets/fastjson_test.c. In line 121, the program at... `ID: qiowoZ%2FDUFf8wA3ZCvKY8M0GHks%3D` Attack Vector
Divide_By_Zero	/src/redis-cli.c: 6037	details The application performs an illegal operation in clusterManagerNodeMasterRandom, in /src/redis-cli.c. In line 6050, the program attempts to divi... `ID: Ez8UONVHfwHV2ShayzJB8j%2B6jPI%3D` Attack Vector

Fixed Issues (4)

Great job! The following issues were fixed in this Pull Request

Severity	Issue	Source File / Package
	~~Buffer_Overflow_Wrong_Buffer_Size~~	/src/redis-cli.c: 3677
	~~Buffer_Overflow_Wrong_Buffer_Size~~	/src/redis-cli.c: 3677
	~~Buffer_Overflow_Wrong_Buffer_Size~~	/src/redis-cli.c: 3677
	~~Buffer_Overflow_Wrong_Buffer_Size~~	/src/redis-cli.c: 3677

Copilot

Pull Request Overview

This PR addresses timeout issues in the CI system by fixing two distinct problems: preventing write buffer blocking in memory efficiency tests and handling cluster down states in diskless load tests.

Modifies write operations to process replies in batches to prevent buffer overflow blocking
Adds error handling for CLUSTERDOWN errors when the cluster state is unstable
Introduces graceful handling of cluster timing issues during master failure scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
tests/unit/memefficiency.tcl	Implements batched reply processing to prevent write buffer blocking during large-scale operations
tests/cluster/tests/17-diskless-load-swapdb.tcl	Adds error handling for CLUSTERDOWN states during replica key validation

Comments suppressed due to low confidence (1)

tests/unit/memefficiency.tcl:337

The variable name 'count' is used in multiple scopes within the same test. Consider using more descriptive names like 'hash_count', 'string_count', and 'del_count' to distinguish between the different counting contexts.

                set count 0

tests/unit/memefficiency.tcl

tests/cluster/tests/17-diskless-load-swapdb.tcl

tests/unit/memefficiency.tcl

sundb · 2025-07-24T12:55:08Z

tests/unit/moduleapi/hooks.tcl

+                ! [is_alive [srv pid]]
+            } else {
+                fail "Replica server process didn't terminate"
+            }


@oranagra please take a look this fix, a better way is to increase the wait num for ASAN, just like Valgrind.
However, I don't want to add a parameter to this minor fix, so I can only manually turn off replia in advance.

redis/tests/support/server.tcl

Lines 100 to 105 in ecd5e63

catch {exec kill -SIGCONT $pid}

if {$::valgrind} {

set max_wait 120000

} else {

set max_wait 10000

}

i'm not sure i understand what makes the termination so slow specifically for this test? does it save a large RDB?

without a large RDB, this slowness occurred with ASAN.
I used taskset -c 0 locally to reproduce it, and it occasionally took 10 secs and can be closed.
But I realized that I only gave a 5 seconds threshold for manually stopping, but it was 10 seconds for kill_server. However, manually closing it works. I need to confirm why.

…th ASAN

sundb · 2025-07-25T06:22:29Z

tests/support/server.tcl

+    # Send SIGCONT before SIGTERM, otherwise shutdown may be slow with ASAN.
    catch {exec kill -SIGCONT $pid}
+    catch {exec kill $pid}


@oranagra continue #14217 (comment)
ASAN intercept signal, so I guess that when we send SIGCONT after SIGTERM, it might start doing some work again, causing the process to close very slowly.
and I saw you said in #8552 (comment)
I think there should be no side effects in changing the order of signal.

i don't think i meant that the order doesn't matter. i think i meant that sending SIGCONT first makes more sense.
but if she reproduced it and it was working then i'm fine.
i don't understand the ASAN issue, but this order makes more sense, and if it fixes the problem, then great.

…ffer into the db` test

… repl buffer into the db` test" This reverts commit e6f8bd8. still fail

sundb · 2025-07-28T01:45:43Z

fully CI: https://github.com/redis/redis/actions/runs/16551840626/job/46807823448
almost green, except for a known issue from rdbchannel, let's deal with it in another PR.

1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.

Follow #14217 Fix #14196 Fix two other issues that might cause timeouts due to command writing via pipe.

1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.

Follow redis#14217 Fix redis#14196 Fix two other issues that might cause timeouts due to command writing via pipe.

1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.

Follow redis#14217 Fix redis#14196 Fix two other issues that might cause timeouts due to command writing via pipe.

1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.

Fix daily CI timeout issues

8ce16e3

sundb force-pushed the stablize_ci branch from 3379e4e to 8ce16e3 Compare July 23, 2025 13:14

sundb requested review from Copilot and oranagra July 23, 2025 14:19

Copilot AI reviewed Jul 23, 2025

View reviewed changes

tests/unit/memefficiency.tcl Show resolved Hide resolved

tests/cluster/tests/17-diskless-load-swapdb.tcl Outdated Show resolved Hide resolved

Update comment

5587292

oranagra approved these changes Jul 24, 2025

View reviewed changes

tests/unit/memefficiency.tcl Show resolved Hide resolved

Fix Test shutdown hook failure

0b91c28

oranagra approved these changes Jul 24, 2025

View reviewed changes

sundb commented Jul 24, 2025

View reviewed changes

sundb mentioned this pull request Jul 24, 2025

Fix timing issue for correct replication disconnection time counters behavior test #14221

Merged

Send SIGCONT signal before SIGTERM signnal to close server quickly wi…

70f6c39

…th ASAN

sundb commented Jul 25, 2025

View reviewed changes

sundb added 4 commits July 25, 2025 14:40

Modify the using of tcl catch

9a710d8

Merge remote-tracking branch 'origin/unstable' into stablize_ci

486ac07

Fix timeout issue for `Test replicaof command while streaming repl bu…

e6f8bd8

…ffer into the db` test

Revert "Fix timeout issue for `Test replicaof command while streaming…

53b2371

… repl buffer into the db` test" This reverts commit e6f8bd8. still fail

sundb changed the title ~~Fix daily CI timeout issues~~ Fix some daily CI issues Jul 28, 2025

sundb merged commit fe3f0aa into redis:unstable Jul 28, 2025
19 checks passed

sundb mentioned this pull request Jul 28, 2025

[BUG] Frequent test TIMEOUTs in 7.2, 7.4 and 8.0 #14196

Closed

sundb mentioned this pull request Jul 30, 2025

Fix timeout issues in memefficiency.tcl #14231

Merged

sundb added this to Redis 7.2 Backport and Redis 8.0 Backport Jul 30, 2025

github-project-automation bot moved this to Todo in Redis 7.2 Backport Jul 30, 2025

github-project-automation bot moved this to Todo in Redis 8.0 Backport Jul 30, 2025

sundb added a commit that referenced this pull request Jul 30, 2025

Fix timeout issues in memefficiency.tcl (#14231)

333f679

Follow #14217 Fix #14196 Fix two other issues that might cause timeouts due to command writing via pipe.

sundb deleted the stablize_ci branch August 7, 2025 02:07

sundb added this to Redis 7.4 Backport Aug 13, 2025

github-project-automation bot moved this to Todo in Redis 7.4 Backport Aug 13, 2025

sundb mentioned this pull request Aug 14, 2025

Fix missing expires_cursor check when existing defrag cycle #14270

Merged

YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 29, 2025

Fix timeout issues in memefficiency.tcl (redis#14231)

1cfac5a

Follow redis#14217 Fix redis#14196 Fix two other issues that might cause timeouts due to command writing via pipe.

YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 30, 2025

Fix timeout issues in memefficiency.tcl (redis#14231)

5e6a000

Follow redis#14217 Fix redis#14196 Fix two other issues that might cause timeouts due to command writing via pipe.

oranagra mentioned this pull request Jan 6, 2026

Test tcp deadlock fixes #14667

Merged

	catch {exec kill -SIGCONT $pid}
	if {$::valgrind} {
	set max_wait 120000
	} else {
	set max_wait 10000
	}

Fix some daily CI issues #14217

Fix some daily CI issues #14217

Uh oh!

Conversation

sundb commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sundb commented Jul 23, 2025

Uh oh!

snyk-io bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎉 Snyk checks have passed. No issues have been found so far.

Uh oh!

kaplanben commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sundb Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

oranagra Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

sundb Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundb Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oranagra Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

sundb commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sundb commented Jul 23, 2025 •

edited

Loading

snyk-io bot commented Jul 23, 2025 •

edited

Loading

kaplanben commented Jul 23, 2025 •

edited

Loading

sundb Jul 25, 2025 •

edited

Loading

sundb Jul 25, 2025 •

edited

Loading