Skip to content

Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30#3941

Merged
enjoy-binbin merged 3 commits into
valkey-io:unstablefrom
enjoy-binbin:fix_delay0
Jun 9, 2026
Merged

Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30#3941
enjoy-binbin merged 3 commits into
valkey-io:unstablefrom
enjoy-binbin:fix_delay0

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Jun 9, 2026

Copy link
Copy Markdown
Member

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and x % 0 is undefined behaviour.

Since valkey-io#2449 made the failover delay relative to node timeout
(`delay = min(cluster-node-timeout / 30, 500)`), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Guards the election scheduling random-delay against zero to avoid random() % 0, and adds unit tests that pause the primary and assert replica promotion with cluster-node-timeout values 30 and 0.

Changes

Cluster Failover Election Delay Fix and Test Coverage

Layer / File(s) Summary
Fix election delay zero-check
src/cluster_legacy.c
Election scheduling delay computation is changed to delay + (delay ? random() % delay : 0) to avoid random() % delay when delay == 0.
Test failover under small timeout settings
tests/unit/cluster/failover.tcl
Adds test_small_timeout {timeout} which sets cluster-node-timeout, pauses the primary, waits for a replica to become master, then resumes the primary. The helper is invoked with timeouts 30 and 0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • zuiderkwast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: preventing undefined behavior from random() % 0 when cluster-node-timeout is below 30.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description accurately explains the bug (random() % 0 undefined behaviour) and references the related change (#2449) that introduced the issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cluster_legacy.c (1)

5605-5816: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Please request @core-team architectural review before merge.

This PR modifies src/cluster_legacy.c, which falls under the mandatory architectural-review path pattern.

As per coding guidelines, "src/{cluster*.c,replication.c,rdb.c,aof.c}: Request @core-team architectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_legacy.c` around lines 5605 - 5816, The change touches cluster
logic in clusterHandleReplicaFailover in src/cluster_legacy.c which is in the
mandatory architectural-review path; before merging, add a clear request for
`@core-team` architectural review (e.g., add a comment in the PR description
and/or tag `@core-team` in the GitHub review request) and ensure the PR title/body
note mentions "architectural review required" per the guideline for
src/cluster*.c files; do not merge until a core-team reviewer signs off.

Source: Coding guidelines

🧹 Nitpick comments (1)
tests/unit/cluster/failover.tcl (1)

265-273: ⚡ Quick win

Guarantee process resume even when the failover wait fails.

If wait_for_condition hits fail, the paused primary may remain stopped because resume_process is after the failure path. Please make resume unconditional to keep test cleanup deterministic.

Suggested hardening
 proc test_small_timeout {timeout} {
-    test "Failover with cluster-node-time set to $timeout" {
+    test "Failover with cluster-node-timeout set to $timeout" {
         R 3 config set cluster-node-timeout $timeout
 
-        pause_process [srv 0 pid]
-        wait_for_condition 1000 50 {
-            [s -3 role] == "master"
-        } else {
-            fail "Failover did not happen"
+        set paused_pid [srv 0 pid]
+        pause_process $paused_pid
+        try {
+            wait_for_condition 1000 50 {
+                [s -3 role] == "master"
+            } else {
+                fail "Failover did not happen"
+            }
+        } finally {
+            resume_process $paused_pid
         }
-
-        resume_process [srv 0 pid]
     }
 }

As per coding guidelines, tests should ensure proper cleanup of resources and temporary files in tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cluster/failover.tcl` around lines 265 - 273, The test pauses the
primary with pause_process [srv 0 pid] but only resumes it after the
wait_for_condition success path, so if wait_for_condition triggers fail the
process stays paused; modify the block around wait_for_condition and fail to
ensure resume_process [srv 0 pid] always runs (e.g. perform the resume before
calling fail or wrap the wait in a catch/conditional that calls resume_process
in both success and failure paths), referencing the wait_for_condition call, the
pause_process invocation, and resume_process [srv 0 pid] so cleanup is
unconditional.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/cluster_legacy.c`:
- Around line 5605-5816: The change touches cluster logic in
clusterHandleReplicaFailover in src/cluster_legacy.c which is in the mandatory
architectural-review path; before merging, add a clear request for `@core-team`
architectural review (e.g., add a comment in the PR description and/or tag
`@core-team` in the GitHub review request) and ensure the PR title/body note
mentions "architectural review required" per the guideline for src/cluster*.c
files; do not merge until a core-team reviewer signs off.

---

Nitpick comments:
In `@tests/unit/cluster/failover.tcl`:
- Around line 265-273: The test pauses the primary with pause_process [srv 0
pid] but only resumes it after the wait_for_condition success path, so if
wait_for_condition triggers fail the process stays paused; modify the block
around wait_for_condition and fail to ensure resume_process [srv 0 pid] always
runs (e.g. perform the resume before calling fail or wrap the wait in a
catch/conditional that calls resume_process in both success and failure paths),
referencing the wait_for_condition call, the pause_process invocation, and
resume_process [srv 0 pid] so cleanup is unconditional.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4483be0f-36d0-4eb2-a92a-5bb8fa393a66

📥 Commits

Reviewing files that changed from the base of the PR and between f7b236d and 0cf31cc.

📒 Files selected for processing (2)
  • src/cluster_legacy.c
  • tests/unit/cluster/failover.tcl

@enjoy-binbin enjoy-binbin changed the title Avoid random() % 0 UB when cluster-node-timeout < 30 Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30 Jun 9, 2026

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice quick fix!

Comment thread tests/unit/cluster/failover.tcl Outdated
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.56%. Comparing base (f7b236d) to head (d430dc1).
⚠️ Report is 3 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3941      +/-   ##
============================================
- Coverage     76.68%   76.56%   -0.12%     
============================================
  Files           162      162              
  Lines         80731    80733       +2     
============================================
- Hits          61910    61817      -93     
- Misses        18821    18916      +95     
Files with missing lines Coverage Δ
src/cluster_legacy.c 88.16% <100.00%> (-0.27%) ⬇️

... and 29 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread tests/unit/cluster/failover.tcl
Comment thread tests/unit/cluster/failover.tcl Outdated
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin merged commit d0ffbab into valkey-io:unstable Jun 9, 2026
61 checks passed
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.0 Jun 9, 2026
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 9.1 Jun 9, 2026
@enjoy-binbin enjoy-binbin deleted the fix_delay0 branch June 9, 2026 16:09
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Jun 10, 2026
…-io#3942, valkey-io#3897)

These came in via the unstable merge in valkey-io#3853 but are unrelated to
replication compression. Restore cluster_legacy.c, db.c, and
failover.tcl to the rio-pr base so this review branch shows only the
replication-compression delta.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 10, 2026
…#3941)

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
ranshid added a commit that referenced this pull request Jun 11, 2026
# Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

## Applied

| Source PR | Title | Detail |
|---|---|---|
| #3743 | Fix buffered_reply assert in HFE commands with module keyspace
notifications | cherry-picked in a prior sweep |
| #3766 | Fix flaky block_keyspace_notification test for HGETDEL notify
race | cherry-picked in a prior sweep |
| #3800 | Fix heap-use-after-free in ACL LOAD when client free is
deferred | cherry-picked in a prior sweep |
| #3723 | Fix double-finish and RESP reply violation in cluster slot
migration | cherry-picked in a prior sweep |
| #3872 | Redacting customer information when hide_user_data_from_log is
true in rdb.c, networking.c, debug.c and t_hash | cherry-picked in a
prior sweep |
| #3846 | Fix use-after-free in VM_RegisterClusterMessageReceiver |
cherry-picked in a prior sweep |
| #3806 | Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL |
cherry-picked in a prior sweep |
| #3847 | Harden SENTINEL commands and config rewrite against
control-character injection | |
| #3801 | Validate every DB clause in COPY against ACL db= permissions |
|
| #3851 | Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token | |
| #3848 | Fix cluster AUX-field control-character and delimiter
injection | |
| #3544 | Revert "IO-Threads redesign cleanup work (#3544)" |
cherry-picked in a prior sweep |
| #3888 | Report exact dbid for COPY in ACL LOG when db= access is
denied | conflicts resolved by Claude Code |
| #3942 | Fix shard_id format specifier in UPDATE message log |  |
| #3941 | Avoid random() % 0 undefined behaviour when
cluster-node-timeout < 30 | |

---
*Generated by valkey-ci-agent using Claude Code.*

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: chx9 <lovelypiska@outlook.com>
Signed-off-by: zackcam <zackcam@amazon.com>
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
Signed-off-by: Jules Lasarte <lasartej@amazon.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: akash kumar <akumdev@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: lovelypiska <lovelypiska@outlook.com>
Co-authored-by: zackcam <zackcam@amazon.com>
Co-authored-by: eifrah-aws <eifrah@amazon.com>
Co-authored-by: jjuleslasarte <jules.lasarte@gmail.com>
Co-authored-by: Jules Lasarte <lasartej@amazon.com>
Co-authored-by: Akash Kumar <45854686+akashkgit@users.noreply.github.com>
@sarthakaggarwal97 sarthakaggarwal97 moved this from To be backported to Done in Valkey 9.1 Jun 15, 2026
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 22, 2026
…#3941)

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To be backported
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants