Skip to content

Conversation

@sundb
Copy link
Collaborator

@sundb sundb commented Jul 23, 2025

  1. Fix the timeout of Active defrag big keys: standalone
    [BUG] Frequent test TIMEOUTs in 7.2, 7.4 and 8.0 #14196
    Using a pipe to write commands may cause the write to block if the read buffer becomes full.

  2. Fix the failure of Main db not affected when fail to diskless load test
    https://github.com/redis/redis/actions/runs/16458252382/job/46525605155
    If the master was killed in slow environment, then after cluster-node-timeout (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error.

FAILED: caught an error in the test CLUSTERDOWN The cluster is down
CLUSTERDOWN The cluster is down
    while executing
"$replica get $slot0_key"
    ("uplevel" body line 68)
    invoked from within
"uplevel 1 $code"
    (procedure "test" line 6)
    invoked from within
"test "Main db not affected when fail to diskless load" {
    set master [Rn 0]
    set replica [Rn 1]
    set master_id 0
    set replica_id 1

    $r..."
    (file "../tests/17-diskless-load-swapdb.tcl" line 17)
    invoked from within
  1. Fix failure of Test shutdown hook test
    https://github.com/sundb/redis/actions/runs/16475248579/job/46574965184
    ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.

Please note that the crash occurred in ASAN

611065:M 24 Jul 2025 17:11:39.142 * User requested shutdown...
611065:M 24 Jul 2025 17:11:39.142 # <testhook> module-event-shutdown
611065:M 24 Jul 2025 17:11:39.142 * Removing the pid file.
611065:M 24 Jul 2025 17:11:39.142 * Removing the unix socket file.
611065:M 24 Jul 2025 17:11:39.142 # Redis is now ready to exit, bye bye...


=== REDIS BUG REPORT START: Cut & paste starting from here ===
611065:M 24 Jul 2025 17:11:52.474 # Redis 255.255.255 crashed by signal: 11, si_code: 0
611065:M 24 Jul 2025 17:11:52.474 # Accessing address: 0x3e800096438
611065:M 24 Jul 2025 17:11:52.474 # Killed by PID: 615480, UID: 1000
611065:M 24 Jul 2025 17:11:52.474 # Crashed running the instruction at: 0x7f66c1f0e7db

------ STACK TRACE ------
EIP:
/lib/x86_64-linux-gnu/libc.so.6(__sched_yield+0xb)[0x7f66c1f0e7db]

@sundb
Copy link
Collaborator Author

sundb commented Jul 23, 2025

@snyk-io
Copy link

snyk-io bot commented Jul 23, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@kaplanben
Copy link

kaplanben commented Jul 23, 2025

Logo
Checkmarx One – Scan Summary & Details1e69da7b-8f0d-4a9d-83d4-ce45330431a5

New Issues (14)

Checkmarx found the following issues in this Pull Request

Severity Issue Source File / Package Checkmarx Insight
CRITICAL Buffer_Improper_Index_Access /src/server.c: 1176
detailsThe array index ClientsPeakMemOutput at /src/server.c in line 1176 is used to reference an index of a cell of the array ClientsPeakMemOutput at /s...
ID: GgeazR3MvmKzsrJAVgSUuiBGeMU%3D
Attack Vector
CRITICAL Buffer_Improper_Index_Access /src/server.c: 1175
detailsThe array index ClientsPeakMemInput at /src/server.c in line 1175 is used to reference an index of a cell of the array ClientsPeakMemInput at /src...
ID: udAjGVE9braRBX9rGBu24bIk31A%3D
Attack Vector
CRITICAL Buffer_Improper_Index_Access /src/server.c: 7315
detailsThe array index argv at /src/server.c in line 7315 is used to reference an index of a cell of the array s at /src/sds.h in line 75 in a way that ...
ID: Hy4boAzcPUf16Hnvb5oV11ktT64%3D
Attack Vector
CRITICAL Buffer_Improper_Index_Access /src/server.c: 7315
detailsThe array index argv at /src/server.c in line 7315 is used to reference an index of a cell of the array s at /src/sds.h in line 69 in a way that ...
ID: esJS1r1KprXVsVpubfWKIpmvlMA%3D
Attack Vector
CRITICAL Buffer_Improper_Index_Access /src/server.c: 7315
detailsThe array index argv at /src/server.c in line 7315 is used to reference an index of a cell of the array s at /src/sds.c in line 216 in a way that...
ID: 8%2Brnv7hEl03TWpL4kSiDgRhFLEg%3D
Attack Vector
CRITICAL Buffer_Improper_Index_Access /src/server.c: 162
detailsThe array index syslogLevelMap at /src/server.c in line 162 is used to reference an index of a cell of the array syslogLevelMap at /src/server.c ...
ID: 5IWtqSlxOwlydz4Iyi%2BMhJbqDeY%3D
Attack Vector
CRITICAL Buffer_Improper_Index_Access /src/server.c: 1011
detailsThe array index stat_clients_type_memory at /src/server.c in line 1011 is used to reference an index of a cell of the array stat_clients_type...
ID: OwtzfKoGd5j%2Fc9qaD6qR20CnpsI%3D
Attack Vector
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 3676
detailsThe buffer buf created in /src/redis-cli.c at line 3676 is written to a buffer in /deps/hiredis/sds.c at line 234 by hdrlen, but an error in cal...
ID: MqnP7QsVuVyOykiPzdkQCxsw7H4%3D
Attack Vector
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /deps/linenoise/linenoise.c: 1200
detailsThe buffer buf created in /deps/linenoise/linenoise.c at line 1200 is written to a buffer in /deps/hiredis/sds.c at line 97 by sh, but an error i...
ID: L1dS1AWIau3DXfxVZMTussbT9bE%3D
Attack Vector
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 3676
detailsThe buffer buf created in /src/redis-cli.c at line 3676 is written to a buffer in /deps/hiredis/sds.c at line 234 by newsh, but an error in calc...
ID: MB1M%2FHPv8FdnbXxmZA9TAV1%2BzTs%3D
Attack Vector
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 10588
detailsThe buffer argv created in /src/redis-cli.c at line 10588 is written to a buffer in /deps/hiredis/sds.c at line 97 by sh, but an error in calcul...
ID: vC%2FFYG%2FXR4U4ciJLaAPiJS9TD80%3D
Attack Vector
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /deps/linenoise/linenoise.c: 1166
detailsThe buffer fgetc created in /deps/linenoise/linenoise.c at line 1166 is written to a buffer in /deps/hiredis/sds.c at line 97 by sh, but an error...
ID: GntXTIkoxcin9A6wr5%2FfrOjRD7o%3D
Attack Vector
MEDIUM Divide_By_Zero /modules/vector-sets/fastjson_test.c: 121
detailsThe application performs an illegal operation in generate_random_string, in /modules/vector-sets/fastjson_test.c. In line 121, the program at...
ID: qiowoZ%2FDUFf8wA3ZCvKY8M0GHks%3D
Attack Vector
MEDIUM Divide_By_Zero /src/redis-cli.c: 6037
detailsThe application performs an illegal operation in clusterManagerNodeMasterRandom, in /src/redis-cli.c. In line 6050, the program attempts to divi...
ID: Ez8UONVHfwHV2ShayzJB8j%2B6jPI%3D
Attack Vector
Fixed Issues (4)

Great job! The following issues were fixed in this Pull Request

Severity Issue Source File / Package
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 3677
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 3677
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 3677
CRITICAL Buffer_Overflow_Wrong_Buffer_Size /src/redis-cli.c: 3677

@sundb sundb requested review from Copilot and oranagra July 23, 2025 14:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses timeout issues in the CI system by fixing two distinct problems: preventing write buffer blocking in memory efficiency tests and handling cluster down states in diskless load tests.

  • Modifies write operations to process replies in batches to prevent buffer overflow blocking
  • Adds error handling for CLUSTERDOWN errors when the cluster state is unstable
  • Introduces graceful handling of cluster timing issues during master failure scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/unit/memefficiency.tcl Implements batched reply processing to prevent write buffer blocking during large-scale operations
tests/cluster/tests/17-diskless-load-swapdb.tcl Adds error handling for CLUSTERDOWN states during replica key validation
Comments suppressed due to low confidence (1)

tests/unit/memefficiency.tcl:337

  • The variable name 'count' is used in multiple scopes within the same test. Consider using more descriptive names like 'hash_count', 'string_count', and 'del_count' to distinguish between the different counting contexts.
                set count 0

! [is_alive [srv pid]]
} else {
fail "Replica server process didn't terminate"
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra please take a look this fix, a better way is to increase the wait num for ASAN, just like Valgrind.
However, I don't want to add a parameter to this minor fix, so I can only manually turn off replia in advance.

catch {exec kill -SIGCONT $pid}
if {$::valgrind} {
set max_wait 120000
} else {
set max_wait 10000
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure i understand what makes the termination so slow specifically for this test? does it save a large RDB?

Copy link
Collaborator Author

@sundb sundb Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without a large RDB, this slowness occurred with ASAN.
I used taskset -c 0 locally to reproduce it, and it occasionally took 10 secs and can be closed.
But I realized that I only gave a 5 seconds threshold for manually stopping, but it was 10 seconds for kill_server. However, manually closing it works. I need to confirm why.

Comment on lines +99 to +101
# Send SIGCONT before SIGTERM, otherwise shutdown may be slow with ASAN.
catch {exec kill -SIGCONT $pid}
catch {exec kill $pid}
Copy link
Collaborator Author

@sundb sundb Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra continue #14217 (comment)
ASAN intercept signal, so I guess that when we send SIGCONT after SIGTERM, it might start doing some work again, causing the process to close very slowly.
and I saw you said in #8552 (comment)
I think there should be no side effects in changing the order of signal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think i meant that the order doesn't matter. i think i meant that sending SIGCONT first makes more sense.
but if she reproduced it and it was working then i'm fine.
i don't understand the ASAN issue, but this order makes more sense, and if it fixes the problem, then great.

@sundb
Copy link
Collaborator Author

sundb commented Jul 28, 2025

fully CI: https://github.com/redis/redis/actions/runs/16551840626/job/46807823448
almost green, except for a known issue from rdbchannel, let's deal with it in another PR.

@sundb sundb changed the title Fix daily CI timeout issues Fix some daily CI issues Jul 28, 2025
@sundb sundb merged commit fe3f0aa into redis:unstable Jul 28, 2025
19 checks passed
miles-byted pushed a commit to miles-byted/redis that referenced this pull request Jul 29, 2025
1) Fix the timeout of `Active defrag big keys: standalone`
Using a pipe to write commands may cause the write to block if the read
buffer becomes full.

2) Fix the failure of `Main db not affected when fail to diskless load`
test
If the master was killed in slow environment, then after
`cluster-node-timeout` (3s in our test), running keyspace commands on
the replica will get a CLUSTERDOWN error.

3) Fix the failure of `Test shutdown hook` test
ASAN can intercept a signal, so I guess that when we send SIGCONT after
SIGTERM to kill the server, it might start doing some work again,
causing the process to close very slowly.
miles-byted pushed a commit to miles-byted/redis that referenced this pull request Jul 29, 2025
1) Fix the timeout of `Active defrag big keys: standalone`
Using a pipe to write commands may cause the write to block if the read
buffer becomes full.

2) Fix the failure of `Main db not affected when fail to diskless load`
test
If the master was killed in slow environment, then after
`cluster-node-timeout` (3s in our test), running keyspace commands on
the replica will get a CLUSTERDOWN error.

3) Fix the failure of `Test shutdown hook` test
ASAN can intercept a signal, so I guess that when we send SIGCONT after
SIGTERM to kill the server, it might start doing some work again,
causing the process to close very slowly.
sundb added a commit that referenced this pull request Jul 30, 2025
Follow #14217
Fix #14196

Fix two other issues that might cause timeouts due to command writing
via pipe.
@sundb sundb deleted the stablize_ci branch August 7, 2025 02:07
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 29, 2025
1) Fix the timeout of `Active defrag big keys: standalone`
Using a pipe to write commands may cause the write to block if the read
buffer becomes full.

2) Fix the failure of `Main db not affected when fail to diskless load`
test
If the master was killed in slow environment, then after
`cluster-node-timeout` (3s in our test), running keyspace commands on
the replica will get a CLUSTERDOWN error.

3) Fix the failure of `Test shutdown hook` test
ASAN can intercept a signal, so I guess that when we send SIGCONT after
SIGTERM to kill the server, it might start doing some work again,
causing the process to close very slowly.
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 29, 2025
Follow redis#14217
Fix redis#14196

Fix two other issues that might cause timeouts due to command writing
via pipe.
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 30, 2025
1) Fix the timeout of `Active defrag big keys: standalone`
Using a pipe to write commands may cause the write to block if the read
buffer becomes full.

2) Fix the failure of `Main db not affected when fail to diskless load`
test
If the master was killed in slow environment, then after
`cluster-node-timeout` (3s in our test), running keyspace commands on
the replica will get a CLUSTERDOWN error.

3) Fix the failure of `Test shutdown hook` test
ASAN can intercept a signal, so I guess that when we send SIGCONT after
SIGTERM to kill the server, it might start doing some work again,
causing the process to close very slowly.
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 30, 2025
Follow redis#14217
Fix redis#14196

Fix two other issues that might cause timeouts due to command writing
via pipe.
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Sep 30, 2025
1) Fix the timeout of `Active defrag big keys: standalone`
Using a pipe to write commands may cause the write to block if the read
buffer becomes full.

2) Fix the failure of `Main db not affected when fail to diskless load`
test
If the master was killed in slow environment, then after
`cluster-node-timeout` (3s in our test), running keyspace commands on
the replica will get a CLUSTERDOWN error.

3) Fix the failure of `Test shutdown hook` test
ASAN can intercept a signal, so I guess that when we send SIGCONT after
SIGTERM to kill the server, it might start doing some work again,
causing the process to close very slowly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants