Skip to content

Fix double-finish and RESP reply violation in cluster slot migration#3723

Merged
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
chx9:fix/slot-migration-double-finish-and-reply-violation
May 21, 2026
Merged

Fix double-finish and RESP reply violation in cluster slot migration#3723
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
chx9:fix/slot-migration-double-finish-and-reply-violation

Conversation

@chx9

@chx9 chx9 commented May 15, 2026

Copy link
Copy Markdown
Contributor
  1. Fix double-call to finishSlotMigrationJob in clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the function was called once with 'assigned to myself' message, then fell through to call it again with 'no longer owned by source'. Added else branch to make the two error paths mutually exclusive.

  2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish(): addReply(c, shared.ok) was sent unconditionally before validating the job name and state, so on error the client would receive both +OK and -ERR. Moved the OK reply after all validations pass.

Note that it is an internal command and the only clients sending it are primary to replica connections and fake AOF clients. Both of those turn off replies, so the reply doesn't actually get parsed and therefore the violation has no effect. But good to clean it up

1. Fix double-call to finishSlotMigrationJob in clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the function was called once with 'assigned to myself' message, then fell through to call it again with 'no longer owned by source'. Added else branch to make the two error paths mutually exclusive.

2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish(): addReply(c, shared.ok) was sent unconditionally before validating the job name and state, so on error the client would receive both +OK and -ERR. Moved the OK reply after all validations pass.

Signed-off-by: chx9 <lovelypiska@outlook.com>
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR corrects error-handling behavior in cluster slot migration. Changes reorder when success replies are sent and improve error message clarity for slot import failures to distinguish between specific and general ownership scenarios.

Changes

Cluster slot migration error handling refinements

Layer / File(s) Summary
SyncSlotsFinish reply ordering
src/cluster_migrateslots.c
The OK reply is moved to occur after locating the migration job and validating the state argument, preventing the client from receiving success before validation errors can be detected.
Slot import failure message distinction
src/cluster_migrateslots.c
Failure handling is refined to distinguish between slots unexpectedly assigned to the current node during import versus the general case where slots are no longer owned by the source, with updated error messages for clarity.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested reviewers

  • zuiderkwast
  • enjoy-binbin
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the two main fixes in the changeset: preventing double-finish and fixing a RESP protocol violation in cluster slot migration.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly relates to the changeset, detailing specific bug fixes in cluster slot migration code with context about the issues being addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cluster_migrateslots.c (1)

938-945: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Request @core-team architectural review before merge.

This PR touches src/cluster_migrateslots.c; please request @core-team review per repository policy.

As per coding guidelines, src/{cluster*.c,replication.c,rdb.c,aof.c}: Request @core-team architectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_migrateslots.c` around lines 938 - 945, This change modifies
src/cluster_migrateslots.c (affecting functions around finishSlotMigrationJob
and constants like SLOT_MIGRATION_JOB_FAILED) and per repo policy requires an
architectural review from `@core-team`; update the PR by explicitly requesting
`@core-team` review (e.g., add `@core-team` to reviewers or the PR description) and
mention that the change touches cluster_migrateslots.c/cluster*.c so an
architectural review is required before merge.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/cluster_migrateslots.c`:
- Around line 938-945: This change modifies src/cluster_migrateslots.c
(affecting functions around finishSlotMigrationJob and constants like
SLOT_MIGRATION_JOB_FAILED) and per repo policy requires an architectural review
from `@core-team`; update the PR by explicitly requesting `@core-team` review (e.g.,
add `@core-team` to reviewers or the PR description) and mention that the change
touches cluster_migrateslots.c/cluster*.c so an architectural review is required
before merge.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 88aa456c-0133-4c1b-b3f9-8e5b2b693e79

📥 Commits

Reviewing files that changed from the base of the PR and between 0321a69 and 831a4e8.

📒 Files selected for processing (1)
  • src/cluster_migrateslots.c

@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.67%. Comparing base (390a11c) to head (831a4e8).
⚠️ Report is 12 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3723      +/-   ##
============================================
+ Coverage     76.63%   76.67%   +0.03%     
============================================
  Files           162      162              
  Lines         80662    80674      +12     
============================================
+ Hits          61818    61854      +36     
+ Misses        18844    18820      -24     
Files with missing lines Coverage Δ
src/cluster_migrateslots.c 92.14% <100.00%> (+0.15%) ⬆️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@enjoy-binbin enjoy-binbin requested a review from murphyjacob4 May 15, 2026 11:23
@enjoy-binbin enjoy-binbin moved this to To be backported in Valkey 9.0 May 15, 2026
@enjoy-binbin enjoy-binbin moved this to To be backported in Valkey 9.1 May 15, 2026
@murphyjacob4

Copy link
Copy Markdown
Contributor

Fix RESP protocol violation in clusterCommandSyncSlotsFinish()

Note that it is an internal command and the only clients sending it are primary to replica connections and fake AOF clients. Both of those turn off replies, so the reply doesn't actually get parsed and therefore the violation has no effect. But good to clean it up

@enjoy-binbin enjoy-binbin merged commit d778b37 into valkey-io:unstable May 21, 2026
63 checks passed
valkeyrie-ops Bot pushed a commit that referenced this pull request May 22, 2026
…3723)

1. Fix double-call to finishSlotMigrationJob in
clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the
function was called once with 'assigned to myself' message, then fell
through to call it again with 'no longer owned by source'. Added else
branch to make the two error paths mutually exclusive.

2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish():
addReply(c, shared.ok) was sent unconditionally before validating the
job name and state, so on error the client would receive both +OK and
-ERR. Moved the OK reply after all validations pass.

Note that it is an internal command and the only clients sending it are
primary to replica connections and fake AOF clients. Both of those turn
off replies, so the reply doesn't actually get parsed and therefore the
violation has no effect. But good to clean it up

Signed-off-by: chx9 <lovelypiska@outlook.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request May 29, 2026
…3723)

1. Fix double-call to finishSlotMigrationJob in
clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the
function was called once with 'assigned to myself' message, then fell
through to call it again with 'no longer owned by source'. Added else
branch to make the two error paths mutually exclusive.

2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish():
addReply(c, shared.ok) was sent unconditionally before validating the
job name and state, so on error the client would receive both +OK and
-ERR. Moved the OK reply after all validations pass.

Note that it is an internal command and the only clients sending it are
primary to replica connections and fake AOF clients. Both of those turn
off replies, so the reply doesn't actually get parsed and therefore the
violation has no effect. But good to clean it up

Signed-off-by: chx9 <lovelypiska@outlook.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 10, 2026
…3723)

1. Fix double-call to finishSlotMigrationJob in
clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the
function was called once with 'assigned to myself' message, then fell
through to call it again with 'no longer owned by source'. Added else
branch to make the two error paths mutually exclusive.

2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish():
addReply(c, shared.ok) was sent unconditionally before validating the
job name and state, so on error the client would receive both +OK and
-ERR. Moved the OK reply after all validations pass.

Note that it is an internal command and the only clients sending it are
primary to replica connections and fake AOF clients. Both of those turn
off replies, so the reply doesn't actually get parsed and therefore the
violation has no effect. But good to clean it up

Signed-off-by: chx9 <lovelypiska@outlook.com>
ranshid added a commit that referenced this pull request Jun 11, 2026
# Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

## Applied

| Source PR | Title | Detail |
|---|---|---|
| #3743 | Fix buffered_reply assert in HFE commands with module keyspace
notifications | cherry-picked in a prior sweep |
| #3766 | Fix flaky block_keyspace_notification test for HGETDEL notify
race | cherry-picked in a prior sweep |
| #3800 | Fix heap-use-after-free in ACL LOAD when client free is
deferred | cherry-picked in a prior sweep |
| #3723 | Fix double-finish and RESP reply violation in cluster slot
migration | cherry-picked in a prior sweep |
| #3872 | Redacting customer information when hide_user_data_from_log is
true in rdb.c, networking.c, debug.c and t_hash | cherry-picked in a
prior sweep |
| #3846 | Fix use-after-free in VM_RegisterClusterMessageReceiver |
cherry-picked in a prior sweep |
| #3806 | Add ALL_DBS flag to CLUSTER FLUSHSLOT for database-level ACL |
cherry-picked in a prior sweep |
| #3847 | Harden SENTINEL commands and config rewrite against
control-character injection | |
| #3801 | Validate every DB clause in COPY against ACL db= permissions |
|
| #3851 | Replace AUTOMATION_PAT with Valkeyrie Bot GitHub App token | |
| #3848 | Fix cluster AUX-field control-character and delimiter
injection | |
| #3544 | Revert "IO-Threads redesign cleanup work (#3544)" |
cherry-picked in a prior sweep |
| #3888 | Report exact dbid for COPY in ACL LOG when db= access is
denied | conflicts resolved by Claude Code |
| #3942 | Fix shard_id format specifier in UPDATE message log |  |
| #3941 | Avoid random() % 0 undefined behaviour when
cluster-node-timeout < 30 | |

---
*Generated by valkey-ci-agent using Claude Code.*

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: chx9 <lovelypiska@outlook.com>
Signed-off-by: zackcam <zackcam@amazon.com>
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
Signed-off-by: Jules Lasarte <lasartej@amazon.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: akash kumar <akumdev@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: lovelypiska <lovelypiska@outlook.com>
Co-authored-by: zackcam <zackcam@amazon.com>
Co-authored-by: eifrah-aws <eifrah@amazon.com>
Co-authored-by: jjuleslasarte <jules.lasarte@gmail.com>
Co-authored-by: Jules Lasarte <lasartej@amazon.com>
Co-authored-by: Akash Kumar <45854686+akashkgit@users.noreply.github.com>
@sarthakaggarwal97 sarthakaggarwal97 moved this from To be backported to Done in Valkey 9.1 Jun 15, 2026
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 18, 2026
…3723)

1. Fix double-call to finishSlotMigrationJob in
clusterUpdateSlotImportsOnOwnershipChange(): when n == myself, the
function was called once with 'assigned to myself' message, then fell
through to call it again with 'no longer owned by source'. Added else
branch to make the two error paths mutually exclusive.

2. Fix RESP protocol violation in clusterCommandSyncSlotsFinish():
addReply(c, shared.ok) was sent unconditionally before validating the
job name and state, so on error the client would receive both +OK and
-ERR. Moved the OK reply after all validations pass.

Note that it is an internal command and the only clients sending it are
primary to replica connections and fake AOF clients. Both of those turn
off replies, so the reply doesn't actually get parsed and therefore the
violation has no effect. But good to clean it up

Signed-off-by: chx9 <lovelypiska@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To be backported
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants