Migrate the remaining cluster tests to the new framework and remove legacy files (#2297) by junyeong0619 · Pull Request #3382 · valkey-io/valkey

junyeong0619 · 2026-03-19T02:41:30Z

Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.

Changes:
Moved and mapped 6 test files:

03-failover-loop.tcl → Merged into existing failover.tcl
04-resharding.tcl → resharding.tcl
12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
replica-migration-slow.tcl
07-replica-migration.tcl → Merged into existing replica-migration.tcl
28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:

Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv)
Added process_is_alive check in cluster_util.tcl to fix an exception in
failover tests caused by executing ps on dead processes
Heavy tests (resharding, replica-migration-slow) marked with slow tag and
wrapped in run_solo to prevent resource contention in sanitizer environments
replica-migration-slow marked with valgrind:skip tag since it is very slow
Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
includes/, and helpers/
Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
tests are now included in ./runtest

Closes #2297.

codecov · 2026-03-19T09:26:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (d2db0c2) to head (03c80fc).
⚠️ Report is 9 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3382      +/-   ##
============================================
+ Coverage     76.35%   76.38%   +0.02%     
============================================
  Files           159      159              
  Lines         80054    80054              
============================================
+ Hits          61125    61149      +24     
+ Misses        18929    18905      -24

see 25 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

junyeong0619 · 2026-03-19T10:40:24Z

The test-sanitizer-address CI job is failing with exit code 143 (runner OOM).

./runtest --accurate includes cluster tests by default (unit/cluster is part
of ::all_tests), so the newly migrated tests run under ASAN with full memory
overhead. The migrated tests are relatively long-running locally
(replica-migration ~325s, replica-migration-2 ~208s, replica-migration-3
~198s), and under ASAN this appears to push the runner over its memory limit
around test 87/159.

Should these long-running cluster tests be tagged to skip under ASAN, or is
there a preferred way to handle this?

junyeong0619 · 2026-03-22T05:29:23Z

Hello, @hpatro. Can you take a look at my work?

enjoy-binbin

Thank you. Sorry i am busy these days so i am not able to review everything. I took a look and would like to mention some things:

I think we can remove all this cluster tests in the CI yml files. Since now the normal ./runtest itself is including all the tests (including all the cluster test). We no longer distinguish between them.
I think we can keep runtest-cluster, and just change it to use runtest --cluster

junyeong0619 · 2026-03-27T10:59:45Z

I've applied both of your suggestions:

Removed the explicit cluster test steps from the CI yml files.
Kept runtest-cluster as a thin wrapper that delegates to ./runtest --cluster.

All tests pass locally. Please take a look when you have time. Thank you!

enjoy-binbin

Thanks you, overall LGTM.

junyeong0619 · 2026-03-31T02:43:47Z

Thank you for your review and lgtm!!

enjoy-binbin · 2026-03-31T08:37:21Z

#3382 (comment)
@sarthakaggarwal97 would you able to take a look with this one? I still want that we can run all the cluster tests within the normal CI.

sarthakaggarwal97

Mostly LGTM! I do want us to verify that we are not losing existing any CI coverage after this change.

junyeong0619 · 2026-04-14T21:52:55Z

Mostly LGTM! I do want us to verify that we are not losing existing any CI coverage after this change.

@enjoy-binbin , @sarthakaggarwal97
I now realize the concern. The original setup ran
cluster tests separately via ./runtest-cluster, but after migration they're
part of ::all_tests, so ./runtest covers them automatically — no separate
step needed.

The -cluster exclusion in ci.yml was added to avoid OOM in memory-intensive
jobs (e.g. ASAN), but it ended up being applied to regular CI jobs too,
which actually reduces coverage compared to before.

Would it make sense to remove -cluster from regular CI jobs and keep it only
for memory-intensive jobs like test-sanitizer-address?

sarthakaggarwal97 · 2026-04-21T17:51:04Z

Yeah, it would be good to include the tests back in CI

Move and convert 6 test files from tests/cluster/tests/ to tests/unit/cluster/: - 03-failover-loop.tcl → failover-loop.tcl - 04-resharding.tcl → resharding.tcl - 07-replica-migration.tcl → merged into replica-migration.tcl - 12-replica-migration-2.tcl → replica-migration-2.tcl - 12.1-replica-migration-3.tcl → replica-migration-3.tcl - 28-cluster-shards.tcl → merged into cluster-shards.tcl Convert old framework APIs (K, RI, get_instance_attrib, etc.) to new APIs (R, srv, cluster_get_myself, etc.). Add process_is_alive check in cluster_util.tcl to fix an exception caused by running ps on dead processes during failover tests. Restore tests/helpers/onlydots.tcl (previously in tests/cluster/tests/helpers/) which is needed by the resharding test to filter reshard progress output. Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>

junyeong0619 · 2026-04-22T10:54:31Z

Yeah, it would be good to include the tests back in CI

I've added the cluster tests back to the CI. All tests are passing in my forked repository.
Please rerun the CI job in this PR.

…#2297) Remove the obsolete tests/cluster/ directory (run.tcl, cluster.tcl, includes/, and helpers/). Keep runtest-cluster as a thin wrapper that delegates to ./runtest --cluster for backward compatibility with scripts that still call it directly. Remove the explicit cluster test steps from daily.yml CI jobs. Since cluster tests are now part of ::all_tests in tests/unit/cluster/, the existing ./runtest step already covers them; a separate ./runtest --cluster step would run them twice. Skip cluster tests in jobs with constrained runner memory to prevent OOM: - ci.yml: test-sanitizer-address (cluster tests are still covered by other jobs) - daily.yml: test-sanitizer-address, test-sanitizer-undefined, test-sanitizer-force-defrag (ASAN/UBSAN overhead combined with the multi-node cluster setup exhausts the 16GB GitHub runner RAM) Add explicit cluster test steps to ci.yml for jobs with sufficient memory: - ci.yml: test-ubuntu-latest, test-ubuntu-latest-cmake-tls Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>

enjoy-binbin · 2026-04-23T04:42:25Z

Have you identified which specific test case is triggering the OOM? If so, we could perhaps try to optimize the tests to consume less memory. I suppose it was the replica-migration could eat a lot of memory since they are creating a lot of nodes

junyeong0619 · 2026-04-23T04:59:58Z

Have you identified which specific test case is triggering the OOM? If so, we could perhaps try to optimize the tests to consume less memory. I suppose it was the replica-migration could eat a lot of memory since they are creating a lot of nodes

I mistakenly added cluster tests to the ASAN job thinking it was already
there, but it turns out the failures weren't caused by that. Either way,
since the upstream ASAN job never ran cluster tests, we should keep it that
way.

Below url is recently failed in test-sanitizer in my local repo.

https://github.com/junyeong0619/valkey/actions/runs/24769641258/job/72472463740

Thanks for your review! Please tell me anything if there is something to do more.

enjoy-binbin · 2026-04-23T05:14:12Z

Sorry i did not keep a close eye

We should leave .github/workflows/ci.yml as it is, because ./runtest contains all of them.
For the .github/workflows/daily.yml, now we migrate the old ./runtest-cluster test, so we should remove all the ./runtest-cluster and do nothing, because ./runtest contains all the tests.

Can you make this change? And then i can trigger the CI, i can take a look with the OOM issue, we should fix the OOM issue in the tests itself (Like we can try to mark the test with SLOW flag or large-memory flag to skip it, but first, let's refresh the PR)

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2026-04-23T05:21:00Z

I pushed a commit, let's see if the CI happy.

junyeong0619 · 2026-04-23T05:25:32Z

Thanks for your help!

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2026-04-24T03:26:16Z

The CI is almost green, i trigger another full CI to do the last check (including valgrind): https://github.com/valkey-io/valkey/actions/runs/24870557949

@junyeong0619 Would you be able to update the top comment based on the latest changes? I've modified some of the code.

junyeong0619 · 2026-04-24T03:35:52Z

The CI is almost green, i trigger another full CI to do the last check (including valgrind): https://github.com/valkey-io/valkey/actions/runs/24870557949

@junyeong0619 Would you be able to update the top comment based on the latest changes? I've modified some of the code.

Yes, I will rebase the change into my commit and apply about the changed contents. Thanks for your help again!

enjoy-binbin · 2026-04-24T03:59:34Z

No need to rebase the commits, you can just update the top comment in this PR. We are going wo squash merge the PR, so no need to worry about those tmp commits.

junyeong0619 · 2026-04-24T04:31:09Z

Migrated the remaining cluster tests to tests/unit/cluster/ to use the same framework for all cluster tests. Cleaned up the obsolete cluster test framework files and updated the CI workflows to use the new unified test runner.

Changes:

Moved and mapped 6 test files:

03-failover-loop.tcl → Merged into existing failover.tcl

04-resharding.tcl → resharding.tcl

12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
replica-migration-slow.tcl

07-replica-migration.tcl → Merged into existing replica-migration.tcl

28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:

Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R,
srv)

Added process_is_alive check in cluster_util.tcl to fix an exception in
failover tests caused by executing ps on dead processes

Heavy tests (resharding, replica-migration-slow) marked with slow tag and
wrapped in run_solo to prevent resource contention in sanitizer environments

Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
includes/, and helpers/

Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")

Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
tests are now included in ./runtest

Closes #2297.

@enjoy-binbin I've just updated the first comment of this PR. Is this what you meant?

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2026-04-27T09:32:09Z

@junyeong0619 Thank you, merged.

junyeong0619 · 2026-04-27T09:33:13Z

@junyeong0619 Thank you, merged.

Thanks for your review and help!!

This change was introduced in #3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>

…egacy files (#2297) (#3382) Migrated the remaining cluster tests to tests/unit/cluster/ to use the same framework for all cluster tests. Cleaned up the obsolete cluster test framework files and updated the CI workflows to use the new unified test runner. Changes: Moved and mapped 6 test files: - 03-failover-loop.tcl → Merged into existing failover.tcl - 04-resharding.tcl → resharding.tcl - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl → replica-migration-slow.tcl - 07-replica-migration.tcl → Merged into existing replica-migration.tcl - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl Other changes: - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv) - Added process_is_alive check in cluster_util.tcl to fix an exception in failover tests caused by executing ps on dead processes - Heavy tests (resharding, replica-migration-slow) marked with slow tag and wrapped in run_solo to prevent resource contention in sanitizer environments - replica-migration-slow marked with valgrind:skip tag since it is very slow - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl, includes/, and helpers/ - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@") - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster tests are now included in ./runtest Closes #2297. Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

This change was introduced in #3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>

github-actions Bot assigned junyeong0619 Mar 19, 2026

junyeong0619 force-pushed the migrate branch 3 times, most recently from 1531b90 to dd1b2d2 Compare March 19, 2026 06:27

enjoy-binbin reviewed Mar 26, 2026

View reviewed changes

junyeong0619 force-pushed the migrate branch 2 times, most recently from 36be3b4 to 7b7eccb Compare March 27, 2026 03:58

enjoy-binbin approved these changes Mar 31, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

sarthakaggarwal97 reviewed Apr 1, 2026

View reviewed changes

Comment thread .github/workflows/daily.yml

Comment thread .github/workflows/daily.yml

junyeong0619 force-pushed the migrate branch from 7b7eccb to 4fbc870 Compare April 2, 2026 18:51

sarthakaggarwal97 approved these changes Apr 14, 2026

View reviewed changes

junyeong0619 force-pushed the migrate branch from f17021c to 447e1a5 Compare April 21, 2026 22:48

junyeong0619 force-pushed the migrate branch 2 times, most recently from c8e29e7 to 5a7b78c Compare April 22, 2026 10:25

junyeong0619 force-pushed the migrate branch from 5a7b78c to 5f13515 Compare April 22, 2026 11:00

enjoy-binbin reviewed Apr 23, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

enjoy-binbin added 2 commits April 23, 2026 13:16

Merge branch 'unstable' into migrate

7e40aff

code view from binbin

12a9d5f

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added 4 commits April 23, 2026 19:47

run solo with cluster failover loop test

ba19c9d

Signed-off-by: Binbin <binloveplay1314@qq.com>

run solo with these cluster tests

897df39

Signed-off-by: Binbin <binloveplay1314@qq.com>

Update the test

f0bf7ab

Signed-off-by: Binbin <binloveplay1314@qq.com>

add runtest-cluster back

b7b3a7f

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 23, 2026

enjoy-binbin added 2 commits April 24, 2026 10:29

move failover-loop to failover

d324c82

Signed-off-by: Binbin <binloveplay1314@qq.com>

Merge branch 'unstable' into migrate

539fdaf

skip valgrind

03c80fc

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/unit/cluster/replica-migration-slow.tcl Outdated

Apply suggestion from @enjoy-binbin

6a6ef6b

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin merged commit ff80b2d into valkey-io:unstable Apr 27, 2026
2 checks passed

enjoy-binbin mentioned this pull request Apr 28, 2026

Skip cluster resharding test under valgrind #3574

Merged

This was referenced May 3, 2026

[Eval-v2] Fix GEOSEARCH BYPOLYGON leak on invalid COUNT sarthakaggarwal97/valkey#131

Closed

[Eval] Agent review bench sarthakaggarwal97/valkey#161

Closed

Uh oh!

Conversation

junyeong0619 commented Mar 19, 2026 • edited by enjoy-binbin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

junyeong0619 commented Mar 19, 2026

Uh oh!

junyeong0619 commented Mar 22, 2026

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

junyeong0619 commented Mar 27, 2026

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junyeong0619 commented Mar 31, 2026

Uh oh!

enjoy-binbin commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

junyeong0619 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 21, 2026

Uh oh!

junyeong0619 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Apr 23, 2026

Uh oh!

junyeong0619 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Apr 23, 2026

Uh oh!

enjoy-binbin commented Apr 23, 2026

Uh oh!

junyeong0619 commented Apr 23, 2026

Uh oh!

enjoy-binbin commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junyeong0619 commented Apr 24, 2026

Uh oh!

enjoy-binbin commented Apr 24, 2026

Uh oh!

junyeong0619 commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Apr 27, 2026

Uh oh!

junyeong0619 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

junyeong0619 commented Mar 19, 2026 •

edited by enjoy-binbin

Loading

codecov Bot commented Mar 19, 2026 •

edited

Loading

junyeong0619 commented Apr 14, 2026 •

edited

Loading

junyeong0619 commented Apr 22, 2026 •

edited

Loading

junyeong0619 commented Apr 23, 2026 •

edited

Loading

enjoy-binbin commented Apr 24, 2026 •

edited

Loading