Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30#3941
Conversation
Since valkey-io#2449 made the failover delay relative to node timeout (`delay = min(cluster-node-timeout / 30, 500)`), any cluster-node-timeout below 30, including the legal minimum 0 will collapses delay to zero, and `x % 0` is undefined behaviour. Signed-off-by: Binbin <binloveplay1314@qq.com>
📝 WalkthroughWalkthroughGuards the election scheduling random-delay against zero to avoid ChangesCluster Failover Election Delay Fix and Test Coverage
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cluster_legacy.c (1)
5605-5816:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPlease request
@core-teamarchitectural review before merge.This PR modifies
src/cluster_legacy.c, which falls under the mandatory architectural-review path pattern.As per coding guidelines, "
src/{cluster*.c,replication.c,rdb.c,aof.c}: Request@core-teamarchitectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cluster_legacy.c` around lines 5605 - 5816, The change touches cluster logic in clusterHandleReplicaFailover in src/cluster_legacy.c which is in the mandatory architectural-review path; before merging, add a clear request for `@core-team` architectural review (e.g., add a comment in the PR description and/or tag `@core-team` in the GitHub review request) and ensure the PR title/body note mentions "architectural review required" per the guideline for src/cluster*.c files; do not merge until a core-team reviewer signs off.Source: Coding guidelines
🧹 Nitpick comments (1)
tests/unit/cluster/failover.tcl (1)
265-273: ⚡ Quick winGuarantee process resume even when the failover wait fails.
If
wait_for_conditionhitsfail, the paused primary may remain stopped becauseresume_processis after the failure path. Please make resume unconditional to keep test cleanup deterministic.Suggested hardening
proc test_small_timeout {timeout} { - test "Failover with cluster-node-time set to $timeout" { + test "Failover with cluster-node-timeout set to $timeout" { R 3 config set cluster-node-timeout $timeout - pause_process [srv 0 pid] - wait_for_condition 1000 50 { - [s -3 role] == "master" - } else { - fail "Failover did not happen" + set paused_pid [srv 0 pid] + pause_process $paused_pid + try { + wait_for_condition 1000 50 { + [s -3 role] == "master" + } else { + fail "Failover did not happen" + } + } finally { + resume_process $paused_pid } - - resume_process [srv 0 pid] } }As per coding guidelines, tests should ensure proper cleanup of resources and temporary files in tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/failover.tcl` around lines 265 - 273, The test pauses the primary with pause_process [srv 0 pid] but only resumes it after the wait_for_condition success path, so if wait_for_condition triggers fail the process stays paused; modify the block around wait_for_condition and fail to ensure resume_process [srv 0 pid] always runs (e.g. perform the resume before calling fail or wrap the wait in a catch/conditional that calls resume_process in both success and failure paths), referencing the wait_for_condition call, the pause_process invocation, and resume_process [srv 0 pid] so cleanup is unconditional.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/cluster_legacy.c`:
- Around line 5605-5816: The change touches cluster logic in
clusterHandleReplicaFailover in src/cluster_legacy.c which is in the mandatory
architectural-review path; before merging, add a clear request for `@core-team`
architectural review (e.g., add a comment in the PR description and/or tag
`@core-team` in the GitHub review request) and ensure the PR title/body note
mentions "architectural review required" per the guideline for src/cluster*.c
files; do not merge until a core-team reviewer signs off.
---
Nitpick comments:
In `@tests/unit/cluster/failover.tcl`:
- Around line 265-273: The test pauses the primary with pause_process [srv 0
pid] but only resumes it after the wait_for_condition success path, so if
wait_for_condition triggers fail the process stays paused; modify the block
around wait_for_condition and fail to ensure resume_process [srv 0 pid] always
runs (e.g. perform the resume before calling fail or wrap the wait in a
catch/conditional that calls resume_process in both success and failure paths),
referencing the wait_for_condition call, the pause_process invocation, and
resume_process [srv 0 pid] so cleanup is unconditional.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 4483be0f-36d0-4eb2-a92a-5bb8fa393a66
📒 Files selected for processing (2)
src/cluster_legacy.ctests/unit/cluster/failover.tcl
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
Nice quick fix!
Co-authored-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Binbin <binloveplay1314@qq.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3941 +/- ##
============================================
- Coverage 76.68% 76.56% -0.12%
============================================
Files 162 162
Lines 80731 80733 +2
============================================
- Hits 61910 61817 -93
- Misses 18821 18916 +95
🚀 New features to boost your workflow:
|
Co-authored-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Binbin <binloveplay1314@qq.com>
…-io#3942, valkey-io#3897) These came in via the unstable merge in valkey-io#3853 but are unrelated to replication compression. Restore cluster_legacy.c, db.c, and failover.tcl to the rio-pr base so this review branch shows only the replication-compression delta. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…#3941) Since #2449 made the failover delay relative to cluster-node-timeout. Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout below 30, including the legal minimum 0 will collapses delay to zero, and `x % 0` is undefined behaviour. Signed-off-by: Binbin <binloveplay1314@qq.com>
# Backport sweep for 9.1 Automated cherry-picks from PRs marked "To be backported". ## Applied | Source PR | Title | Detail | |---|---|---| | #3743 | Fix buffered_reply assert in HFE commands with module keyspace notifications | cherry-picked in a prior sweep | | #3766 | Fix flaky block_keyspace_notification test for HGETDEL notify race | cherry-picked in a prior sweep | | #3800 | Fix heap-use-after-free in ACL LOAD when client free is deferred | cherry-picked in a prior sweep | | #3723 | Fix double-finish and RESP reply violation in cluster slot migration | cherry-picked in a prior sweep | | #3872 | Redacting customer information when hide_user_data_from_log is true in rdb.c, networking.c, debug.c and t_hash | cherry-picked in a prior sweep | | #3846 | Fix use-after-free in VM_RegisterClusterMessageReceiver | cherry-picked in a prior sweep | | #3806 | Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL | cherry-picked in a prior sweep | | #3847 | Harden SENTINEL commands and config rewrite against control-character injection | | | #3801 | Validate every DB clause in COPY against ACL db= permissions | | | #3851 | Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token | | | #3848 | Fix cluster AUX-field control-character and delimiter injection | | | #3544 | Revert "IO-Threads redesign cleanup work (#3544)" | cherry-picked in a prior sweep | | #3888 | Report exact dbid for COPY in ACL LOG when db= access is denied | conflicts resolved by Claude Code | | #3942 | Fix shard_id format specifier in UPDATE message log | | | #3941 | Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30 | | --- *Generated by valkey-ci-agent using Claude Code.* --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: chx9 <lovelypiska@outlook.com> Signed-off-by: zackcam <zackcam@amazon.com> Signed-off-by: Eran Ifrah <eifrah@amazon.com> Signed-off-by: Jules Lasarte <lasartej@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: akash kumar <akumdev@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: lovelypiska <lovelypiska@outlook.com> Co-authored-by: zackcam <zackcam@amazon.com> Co-authored-by: eifrah-aws <eifrah@amazon.com> Co-authored-by: jjuleslasarte <jules.lasarte@gmail.com> Co-authored-by: Jules Lasarte <lasartej@amazon.com> Co-authored-by: Akash Kumar <45854686+akashkgit@users.noreply.github.com>
…#3941) Since #2449 made the failover delay relative to cluster-node-timeout. Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout below 30, including the legal minimum 0 will collapses delay to zero, and `x % 0` is undefined behaviour. Signed-off-by: Binbin <binloveplay1314@qq.com>
Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and
x % 0is undefined behaviour.