Migrate the remaining cluster tests to the new framework and remove legacy files (#2297)#3382
Conversation
1531b90 to
dd1b2d2
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3382 +/- ##
============================================
+ Coverage 76.35% 76.38% +0.02%
============================================
Files 159 159
Lines 80054 80054
============================================
+ Hits 61125 61149 +24
+ Misses 18929 18905 -24 🚀 New features to boost your workflow:
|
|
The test-sanitizer-address CI job is failing with exit code 143 (runner OOM). ./runtest --accurate includes cluster tests by default (unit/cluster is part Should these long-running cluster tests be tagged to skip under ASAN, or is |
|
Hello, @hpatro. Can you take a look at my work? |
enjoy-binbin
left a comment
There was a problem hiding this comment.
Thank you. Sorry i am busy these days so i am not able to review everything. I took a look and would like to mention some things:
- I think we can remove all this
cluster testsin the CI yml files. Since now the normal./runtestitself is including all the tests (including all the cluster test). We no longer distinguish between them. - I think we can keep
runtest-cluster, and just change it to useruntest --cluster
36be3b4 to
7b7eccb
Compare
|
I've applied both of your suggestions:
All tests pass locally. Please take a look when you have time. Thank you! |
enjoy-binbin
left a comment
There was a problem hiding this comment.
Thanks you, overall LGTM.
|
Thank you for your review and lgtm!! |
|
#3382 (comment) |
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
Mostly LGTM! I do want us to verify that we are not losing existing any CI coverage after this change.
@enjoy-binbin , @sarthakaggarwal97 The -cluster exclusion in ci.yml was added to avoid OOM in memory-intensive Would it make sense to remove -cluster from regular CI jobs and keep it only |
|
Yeah, it would be good to include the tests back in CI |
Move and convert 6 test files from tests/cluster/tests/ to tests/unit/cluster/: - 03-failover-loop.tcl → failover-loop.tcl - 04-resharding.tcl → resharding.tcl - 07-replica-migration.tcl → merged into replica-migration.tcl - 12-replica-migration-2.tcl → replica-migration-2.tcl - 12.1-replica-migration-3.tcl → replica-migration-3.tcl - 28-cluster-shards.tcl → merged into cluster-shards.tcl Convert old framework APIs (K, RI, get_instance_attrib, etc.) to new APIs (R, srv, cluster_get_myself, etc.). Add process_is_alive check in cluster_util.tcl to fix an exception caused by running ps on dead processes during failover tests. Restore tests/helpers/onlydots.tcl (previously in tests/cluster/tests/helpers/) which is needed by the resharding test to filter reshard progress output. Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
c8e29e7 to
5a7b78c
Compare
I've added the cluster tests back to the CI. All tests are passing in my forked repository. |
…#2297) Remove the obsolete tests/cluster/ directory (run.tcl, cluster.tcl, includes/, and helpers/). Keep runtest-cluster as a thin wrapper that delegates to ./runtest --cluster for backward compatibility with scripts that still call it directly. Remove the explicit cluster test steps from daily.yml CI jobs. Since cluster tests are now part of ::all_tests in tests/unit/cluster/, the existing ./runtest step already covers them; a separate ./runtest --cluster step would run them twice. Skip cluster tests in jobs with constrained runner memory to prevent OOM: - ci.yml: test-sanitizer-address (cluster tests are still covered by other jobs) - daily.yml: test-sanitizer-address, test-sanitizer-undefined, test-sanitizer-force-defrag (ASAN/UBSAN overhead combined with the multi-node cluster setup exhausts the 16GB GitHub runner RAM) Add explicit cluster test steps to ci.yml for jobs with sufficient memory: - ci.yml: test-ubuntu-latest, test-ubuntu-latest-cmake-tls Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
|
Have you identified which specific test case is triggering the OOM? If so, we could perhaps try to optimize the tests to consume less memory. I suppose it was the replica-migration could eat a lot of memory since they are creating a lot of nodes |
I mistakenly added cluster tests to the ASAN job thinking it was already Below url is recently failed in test-sanitizer in my local repo. https://github.com/junyeong0619/valkey/actions/runs/24769641258/job/72472463740 Thanks for your review! Please tell me anything if there is something to do more. |
|
Sorry i did not keep a close eye
Can you make this change? And then i can trigger the CI, i can take a look with the OOM issue, we should fix the OOM issue in the tests itself (Like we can try to mark the test with SLOW flag or large-memory flag to skip it, but first, let's refresh the PR) |
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
I pushed a commit, let's see if the CI happy. |
|
Thanks for your help! |
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
The CI is almost green, i trigger another full CI to do the last check (including valgrind): https://github.com/valkey-io/valkey/actions/runs/24870557949 @junyeong0619 Would you be able to update the top comment based on the latest changes? I've modified some of the code. |
Yes, I will rebase the change into my commit and apply about the changed contents. Thanks for your help again! |
|
No need to rebase the commits, you can just update the top comment in this PR. We are going wo squash merge the PR, so no need to worry about those tmp commits. |
@enjoy-binbin I've just updated the first comment of this PR. Is this what you meant? |
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
@junyeong0619 Thank you, merged. |
Thanks for your review and help!! |
This change was introduced in #3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>
This change was introduced in #3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>
…egacy files (#2297) (#3382) Migrated the remaining cluster tests to tests/unit/cluster/ to use the same framework for all cluster tests. Cleaned up the obsolete cluster test framework files and updated the CI workflows to use the new unified test runner. Changes: Moved and mapped 6 test files: - 03-failover-loop.tcl → Merged into existing failover.tcl - 04-resharding.tcl → resharding.tcl - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl → replica-migration-slow.tcl - 07-replica-migration.tcl → Merged into existing replica-migration.tcl - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl Other changes: - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv) - Added process_is_alive check in cluster_util.tcl to fix an exception in failover tests caused by executing ps on dead processes - Heavy tests (resharding, replica-migration-slow) marked with slow tag and wrapped in run_solo to prevent resource contention in sanitizer environments - replica-migration-slow marked with valgrind:skip tag since it is very slow - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl, includes/, and helpers/ - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@") - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster tests are now included in ./runtest Closes #2297. Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
This change was introduced in #3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>
Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.
Changes:
Moved and mapped 6 test files:
replica-migration-slow.tcl
Other changes:
failover tests caused by executing ps on dead processes
wrapped in run_solo to prevent resource contention in sanitizer environments
includes/, and helpers/
tests are now included in ./runtest
Closes #2297.