Skip to content

Regression from PR #445 Incorrectly Allows Slot Ownership Updates via Replica #753

Description

@PingXie

I took a look too and realized it’s a regression introduced by my slot migration PR #445. This change started allowing a replica to report its primary’s slot states and trigger clusterUdpateSlotsConfigWith.

image

PR #445 - Slot Migration Changes.

Here's what I think happens in these test failures involving a 3-node shard:

[T1] - Node A, B, and C are in the same shard with A as the primary.
[T2] - Node A loses its primaryship to B via a graceful/manual failover.
[T3] - After winning the election, B broadcasts the news to every node in the cluster, including C.
[T4] - C receives B's latest PING message and correctly registers B as its new primary.
[T5] - C then sends a new PING message to A, claiming B is its primary with all the slots.
[T6] - A still hasn't received B's broadcast message from [T3], and C's PING message from [T4] arrives at A.
And this is where things go wrong—a replicaof cycle is created.

At this point, A still thinks it’s the primary of the shard, and B -> replicaof == A. Since C is still a replica (as before), the role change handling logic doesn’t apply. So, A enters clusterUdpateSlotsConfigWith using C’s slot information (which is up to date with B’s). More importantly, B is the sender, and A assumes B -> replicaof == A. The slot ownership update logic correctly gives the ownership of the slots to B. Because A loses all its slots to another node in the same shard with a higher config epoch, this demotes A to a replica of the winner, B. And we set A -> replicaof = B, completing the replicaof cycle.

Originally posted by @PingXie in #573 (comment)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions