Skip to content

Migrate the remaining cluster tests to the new framework and remove legacy files (#2297)#3382

Merged
enjoy-binbin merged 12 commits into
valkey-io:unstablefrom
junyeong0619:migrate
Apr 27, 2026
Merged

Migrate the remaining cluster tests to the new framework and remove legacy files (#2297)#3382
enjoy-binbin merged 12 commits into
valkey-io:unstablefrom
junyeong0619:migrate

Conversation

@junyeong0619

@junyeong0619 junyeong0619 commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.

Changes:
Moved and mapped 6 test files:

  • 03-failover-loop.tcl → Merged into existing failover.tcl
  • 04-resharding.tcl → resharding.tcl
  • 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
    replica-migration-slow.tcl
  • 07-replica-migration.tcl → Merged into existing replica-migration.tcl
  • 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:

  • Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv)
  • Added process_is_alive check in cluster_util.tcl to fix an exception in
    failover tests caused by executing ps on dead processes
  • Heavy tests (resharding, replica-migration-slow) marked with slow tag and
    wrapped in run_solo to prevent resource contention in sanitizer environments
  • replica-migration-slow marked with valgrind:skip tag since it is very slow
  • Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
    includes/, and helpers/
  • Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
  • Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
    tests are now included in ./runtest

Closes #2297.

@junyeong0619 junyeong0619 force-pushed the migrate branch 3 times, most recently from 1531b90 to dd1b2d2 Compare March 19, 2026 06:27
@codecov

codecov Bot commented Mar 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (d2db0c2) to head (03c80fc).
⚠️ Report is 9 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3382      +/-   ##
============================================
+ Coverage     76.35%   76.38%   +0.02%     
============================================
  Files           159      159              
  Lines         80054    80054              
============================================
+ Hits          61125    61149      +24     
+ Misses        18929    18905      -24     

see 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@junyeong0619

Copy link
Copy Markdown
Contributor Author

The test-sanitizer-address CI job is failing with exit code 143 (runner OOM).

./runtest --accurate includes cluster tests by default (unit/cluster is part
of ::all_tests), so the newly migrated tests run under ASAN with full memory
overhead. The migrated tests are relatively long-running locally
(replica-migration ~325s, replica-migration-2 ~208s, replica-migration-3
~198s), and under ASAN this appears to push the runner over its memory limit
around test 87/159.

Should these long-running cluster tests be tagged to skip under ASAN, or is
there a preferred way to handle this?

@junyeong0619

Copy link
Copy Markdown
Contributor Author

Hello, @hpatro. Can you take a look at my work?

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Sorry i am busy these days so i am not able to review everything. I took a look and would like to mention some things:

  1. I think we can remove all this cluster tests in the CI yml files. Since now the normal ./runtest itself is including all the tests (including all the cluster test). We no longer distinguish between them.
  2. I think we can keep runtest-cluster, and just change it to use runtest --cluster

@junyeong0619 junyeong0619 force-pushed the migrate branch 2 times, most recently from 36be3b4 to 7b7eccb Compare March 27, 2026 03:58
@junyeong0619

Copy link
Copy Markdown
Contributor Author

I've applied both of your suggestions:

  1. Removed the explicit cluster test steps from the CI yml files.
  2. Kept runtest-cluster as a thin wrapper that delegates to ./runtest --cluster.

All tests pass locally. Please take a look when you have time. Thank you!

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you, overall LGTM.

Comment thread .github/workflows/ci.yml Outdated
@junyeong0619

Copy link
Copy Markdown
Contributor Author

Thank you for your review and lgtm!!

@enjoy-binbin

Copy link
Copy Markdown
Member

#3382 (comment)
@sarthakaggarwal97 would you able to take a look with this one? I still want that we can run all the cluster tests within the normal CI.

Comment thread .github/workflows/daily.yml
Comment thread .github/workflows/daily.yml

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM! I do want us to verify that we are not losing existing any CI coverage after this change.

@junyeong0619

junyeong0619 commented Apr 14, 2026

Copy link
Copy Markdown
Contributor Author

Mostly LGTM! I do want us to verify that we are not losing existing any CI coverage after this change.

@enjoy-binbin , @sarthakaggarwal97
I now realize the concern. The original setup ran
cluster tests separately via ./runtest-cluster, but after migration they're
part of ::all_tests, so ./runtest covers them automatically — no separate
step needed.

The -cluster exclusion in ci.yml was added to avoid OOM in memory-intensive
jobs (e.g. ASAN), but it ended up being applied to regular CI jobs too,
which actually reduces coverage compared to before.

Would it make sense to remove -cluster from regular CI jobs and keep it only
for memory-intensive jobs like test-sanitizer-address?

@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

Yeah, it would be good to include the tests back in CI

Move and convert 6 test files from tests/cluster/tests/ to tests/unit/cluster/:

- 03-failover-loop.tcl         → failover-loop.tcl
- 04-resharding.tcl            → resharding.tcl
- 07-replica-migration.tcl     → merged into replica-migration.tcl
- 12-replica-migration-2.tcl   → replica-migration-2.tcl
- 12.1-replica-migration-3.tcl → replica-migration-3.tcl
- 28-cluster-shards.tcl        → merged into cluster-shards.tcl

Convert old framework APIs (K, RI, get_instance_attrib, etc.) to new APIs
(R, srv, cluster_get_myself, etc.).

Add process_is_alive check in cluster_util.tcl to fix an exception caused
by running ps on dead processes during failover tests.

Restore tests/helpers/onlydots.tcl (previously in tests/cluster/tests/helpers/)
which is needed by the resharding test to filter reshard progress output.

Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
@junyeong0619 junyeong0619 force-pushed the migrate branch 2 times, most recently from c8e29e7 to 5a7b78c Compare April 22, 2026 10:25
@junyeong0619

junyeong0619 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

Yeah, it would be good to include the tests back in CI

I've added the cluster tests back to the CI. All tests are passing in my forked repository.
Please rerun the CI job in this PR.

…#2297)

Remove the obsolete tests/cluster/ directory (run.tcl, cluster.tcl,
includes/, and helpers/).

Keep runtest-cluster as a thin wrapper that delegates to
./runtest --cluster for backward compatibility with scripts that still
call it directly.

Remove the explicit cluster test steps from daily.yml CI jobs. Since
cluster tests are now part of ::all_tests in tests/unit/cluster/, the
existing ./runtest step already covers them; a separate ./runtest --cluster
step would run them twice.

Skip cluster tests in jobs with constrained runner memory to prevent OOM:
- ci.yml: test-sanitizer-address (cluster tests are still covered by other jobs)
- daily.yml: test-sanitizer-address, test-sanitizer-undefined,
  test-sanitizer-force-defrag (ASAN/UBSAN overhead combined with the
  multi-node cluster setup exhausts the 16GB GitHub runner RAM)

Add explicit cluster test steps to ci.yml for jobs with sufficient memory:
- ci.yml: test-ubuntu-latest, test-ubuntu-latest-cmake-tls

Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
Comment thread .github/workflows/ci.yml Outdated
@enjoy-binbin

Copy link
Copy Markdown
Member

Have you identified which specific test case is triggering the OOM? If so, we could perhaps try to optimize the tests to consume less memory. I suppose it was the replica-migration could eat a lot of memory since they are creating a lot of nodes

@junyeong0619

junyeong0619 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

Have you identified which specific test case is triggering the OOM? If so, we could perhaps try to optimize the tests to consume less memory. I suppose it was the replica-migration could eat a lot of memory since they are creating a lot of nodes

I mistakenly added cluster tests to the ASAN job thinking it was already
there, but it turns out the failures weren't caused by that. Either way,
since the upstream ASAN job never ran cluster tests, we should keep it that
way.

Below url is recently failed in test-sanitizer in my local repo.

https://github.com/junyeong0619/valkey/actions/runs/24769641258/job/72472463740

Thanks for your review! Please tell me anything if there is something to do more.

@enjoy-binbin

Copy link
Copy Markdown
Member

Sorry i did not keep a close eye

  1. We should leave .github/workflows/ci.yml as it is, because ./runtest contains all of them.
  2. For the .github/workflows/daily.yml, now we migrate the old ./runtest-cluster test, so we should remove all the ./runtest-cluster and do nothing, because ./runtest contains all the tests.

Can you make this change? And then i can trigger the CI, i can take a look with the OOM issue, we should fix the OOM issue in the tests itself (Like we can try to mark the test with SLOW flag or large-memory flag to skip it, but first, let's refresh the PR)

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member

I pushed a commit, let's see if the CI happy.

@junyeong0619

Copy link
Copy Markdown
Contributor Author

Thanks for your help!

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 23, 2026
@enjoy-binbin

enjoy-binbin commented Apr 24, 2026

Copy link
Copy Markdown
Member

The CI is almost green, i trigger another full CI to do the last check (including valgrind): https://github.com/valkey-io/valkey/actions/runs/24870557949

@junyeong0619 Would you be able to update the top comment based on the latest changes? I've modified some of the code.

@junyeong0619

Copy link
Copy Markdown
Contributor Author

The CI is almost green, i trigger another full CI to do the last check (including valgrind): https://github.com/valkey-io/valkey/actions/runs/24870557949

@junyeong0619 Would you be able to update the top comment based on the latest changes? I've modified some of the code.

Yes, I will rebase the change into my commit and apply about the changed contents. Thanks for your help again!

@enjoy-binbin

Copy link
Copy Markdown
Member

No need to rebase the commits, you can just update the top comment in this PR. We are going wo squash merge the PR, so no need to worry about those tmp commits.

@junyeong0619

Copy link
Copy Markdown
Contributor Author

Migrated the remaining cluster tests to tests/unit/cluster/ to use the same framework for all cluster tests. Cleaned up the obsolete cluster test framework files and updated the CI workflows to use the new unified test runner.

Changes:

Moved and mapped 6 test files:

  • 03-failover-loop.tcl → Merged into existing failover.tcl
  • 04-resharding.tcl → resharding.tcl
  • 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
    replica-migration-slow.tcl
  • 07-replica-migration.tcl → Merged into existing replica-migration.tcl
  • 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:

  • Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R,
    srv)
  • Added process_is_alive check in cluster_util.tcl to fix an exception in
    failover tests caused by executing ps on dead processes
  • Heavy tests (resharding, replica-migration-slow) marked with slow tag and
    wrapped in run_solo to prevent resource contention in sanitizer environments
  • Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
    includes/, and helpers/
  • Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
  • Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
    tests are now included in ./runtest

Closes #2297.

@enjoy-binbin I've just updated the first comment of this PR. Is this what you meant?

Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread tests/unit/cluster/replica-migration-slow.tcl Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin merged commit ff80b2d into valkey-io:unstable Apr 27, 2026
2 checks passed
@enjoy-binbin

Copy link
Copy Markdown
Member

@junyeong0619 Thank you, merged.

@junyeong0619

Copy link
Copy Markdown
Contributor Author

@junyeong0619 Thank you, merged.

Thanks for your review and help!!

enjoy-binbin added a commit that referenced this pull request Apr 29, 2026
This change was introduced in #3382. This test is already very slow on
its own. Under valgrind it gets slow enough that the per-node restart
step lets primaries be marked FAIL and triggers failovers, after which
"Verify slaves consistency" no longer holds since it assumes the original
topology.

It was never run under valgrind before and exercises nothing valgrind
meaningfully covers, so just tag it valgrind:skip.

Signed-off-by: Binbin <binloveplay1314@qq.com>
lucasyonge pushed a commit that referenced this pull request May 11, 2026
This change was introduced in #3382. This test is already very slow on
its own. Under valgrind it gets slow enough that the per-node restart
step lets primaries be marked FAIL and triggers failovers, after which
"Verify slaves consistency" no longer holds since it assumes the original
topology.

It was never run under valgrind before and exercises nothing valgrind
meaningfully covers, so just tag it valgrind:skip.

Signed-off-by: Binbin <binloveplay1314@qq.com>
lucasyonge pushed a commit that referenced this pull request May 12, 2026
…egacy files (#2297) (#3382)

Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.

Changes:
  Moved and mapped 6 test files:
  - 03-failover-loop.tcl → Merged into existing failover.tcl
  - 04-resharding.tcl → resharding.tcl
  - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
  replica-migration-slow.tcl
  - 07-replica-migration.tcl → Merged into existing replica-migration.tcl
  - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:
  - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv)
  - Added process_is_alive check in cluster_util.tcl to fix an exception in
  failover tests caused by executing ps on dead processes
  - Heavy tests (resharding, replica-migration-slow) marked with slow tag and
  wrapped in run_solo to prevent resource contention in sanitizer environments
  - replica-migration-slow marked with valgrind:skip tag since it is very slow
  - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
  includes/, and helpers/
  - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
  - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
  tests are now included in ./runtest

Closes #2297.

Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
lucasyonge pushed a commit that referenced this pull request May 12, 2026
This change was introduced in #3382. This test is already very slow on
its own. Under valgrind it gets slow enough that the per-node restart
step lets primaries be marked FAIL and triggers failovers, after which
"Verify slaves consistency" no longer holds since it assumes the original
topology.

It was never run under valgrind before and exercises nothing valgrind
meaningfully covers, so just tag it valgrind:skip.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate tests under tests/cluster to tests/unit/cluster

3 participants