-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix some daily CI issues #14217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some daily CI issues #14217
Conversation
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
|
New Issues (14)Checkmarx found the following issues in this Pull Request
Fixed Issues (4)Great job! The following issues were fixed in this Pull Request
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses timeout issues in the CI system by fixing two distinct problems: preventing write buffer blocking in memory efficiency tests and handling cluster down states in diskless load tests.
- Modifies write operations to process replies in batches to prevent buffer overflow blocking
- Adds error handling for CLUSTERDOWN errors when the cluster state is unstable
- Introduces graceful handling of cluster timing issues during master failure scenarios
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/unit/memefficiency.tcl | Implements batched reply processing to prevent write buffer blocking during large-scale operations |
| tests/cluster/tests/17-diskless-load-swapdb.tcl | Adds error handling for CLUSTERDOWN states during replica key validation |
Comments suppressed due to low confidence (1)
tests/unit/memefficiency.tcl:337
- The variable name 'count' is used in multiple scopes within the same test. Consider using more descriptive names like 'hash_count', 'string_count', and 'del_count' to distinguish between the different counting contexts.
set count 0
tests/unit/moduleapi/hooks.tcl
Outdated
| ! [is_alive [srv pid]] | ||
| } else { | ||
| fail "Replica server process didn't terminate" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oranagra please take a look this fix, a better way is to increase the wait num for ASAN, just like Valgrind.
However, I don't want to add a parameter to this minor fix, so I can only manually turn off replia in advance.
redis/tests/support/server.tcl
Lines 100 to 105 in ecd5e63
| catch {exec kill -SIGCONT $pid} | |
| if {$::valgrind} { | |
| set max_wait 120000 | |
| } else { | |
| set max_wait 10000 | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure i understand what makes the termination so slow specifically for this test? does it save a large RDB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without a large RDB, this slowness occurred with ASAN.
I used taskset -c 0 locally to reproduce it, and it occasionally took 10 secs and can be closed.
But I realized that I only gave a 5 seconds threshold for manually stopping, but it was 10 seconds for kill_server. However, manually closing it works. I need to confirm why.
| # Send SIGCONT before SIGTERM, otherwise shutdown may be slow with ASAN. | ||
| catch {exec kill -SIGCONT $pid} | ||
| catch {exec kill $pid} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oranagra continue #14217 (comment)
ASAN intercept signal, so I guess that when we send SIGCONT after SIGTERM, it might start doing some work again, causing the process to close very slowly.
and I saw you said in #8552 (comment)
I think there should be no side effects in changing the order of signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think i meant that the order doesn't matter. i think i meant that sending SIGCONT first makes more sense.
but if she reproduced it and it was working then i'm fine.
i don't understand the ASAN issue, but this order makes more sense, and if it fixes the problem, then great.
…ffer into the db` test
… repl buffer into the db` test" This reverts commit e6f8bd8. still fail
|
fully CI: https://github.com/redis/redis/actions/runs/16551840626/job/46807823448 |
1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.
1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.
1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.
Follow redis#14217 Fix redis#14196 Fix two other issues that might cause timeouts due to command writing via pipe.
1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.
Follow redis#14217 Fix redis#14196 Fix two other issues that might cause timeouts due to command writing via pipe.
1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.



Fix the timeout of
Active defrag big keys: standalone[BUG] Frequent test TIMEOUTs in 7.2, 7.4 and 8.0 #14196
Using a pipe to write commands may cause the write to block if the read buffer becomes full.
Fix the failure of
Main db not affected when fail to diskless loadtesthttps://github.com/redis/redis/actions/runs/16458252382/job/46525605155
If the master was killed in slow environment, then after
cluster-node-timeout(3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error.Test shutdown hooktesthttps://github.com/sundb/redis/actions/runs/16475248579/job/46574965184
ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly.
Please note that the crash occurred in ASAN