I took a look too and realized it’s a regression introduced by my slot migration PR #445. This change started allowing a replica to report its primary’s slot states and trigger clusterUdpateSlotsConfigWith.

PR #445 - Slot Migration Changes.
Here's what I think happens in these test failures involving a 3-node shard:
[T1] - Node A, B, and C are in the same shard with A as the primary.
[T2] - Node A loses its primaryship to B via a graceful/manual failover.
[T3] - After winning the election, B broadcasts the news to every node in the cluster, including C.
[T4] - C receives B's latest PING message and correctly registers B as its new primary.
[T5] - C then sends a new PING message to A, claiming B is its primary with all the slots.
[T6] - A still hasn't received B's broadcast message from [T3], and C's PING message from [T4] arrives at A.
And this is where things go wrong—a replicaof cycle is created.
At this point, A still thinks it’s the primary of the shard, and B -> replicaof == A. Since C is still a replica (as before), the role change handling logic doesn’t apply. So, A enters clusterUdpateSlotsConfigWith using C’s slot information (which is up to date with B’s). More importantly, B is the sender, and A assumes B -> replicaof == A. The slot ownership update logic correctly gives the ownership of the slots to B. Because A loses all its slots to another node in the same shard with a higher config epoch, this demotes A to a replica of the winner, B. And we set A -> replicaof = B, completing the replicaof cycle.
Originally posted by @PingXie in #573 (comment)
I took a look too and realized it’s a regression introduced by my slot migration PR #445. This change started allowing a replica to report its primary’s slot states and trigger clusterUdpateSlotsConfigWith.
PR #445 - Slot Migration Changes.
Here's what I think happens in these test failures involving a 3-node shard:
[T1] - Node
A,B, andCare in the same shard withAas the primary.[T2] - Node
Aloses its primaryship toBvia a graceful/manual failover.[T3] - After winning the election,
Bbroadcasts the news to every node in the cluster, includingC.[T4] -
CreceivesB's latestPINGmessage and correctly registersBas its new primary.[T5] -
Cthen sends a newPINGmessage toA, claimingBis its primary with all the slots.[T6] -
Astill hasn't receivedB's broadcast message from [T3], andC'sPINGmessage from [T4] arrives atA.And this is where things go wrong—a
replicaofcycle is created.At this point,
Astill thinks it’s the primary of the shard, andB -> replicaof == A. SinceCis still a replica (as before), the role change handling logic doesn’t apply. So,AentersclusterUdpateSlotsConfigWithusingC’s slot information (which is up to date withB’s). More importantly,Bis the sender, andAassumesB -> replicaof == A. The slot ownership update logic correctly gives the ownership of the slots toB. BecauseAloses all its slots to another node in the same shard with a higher config epoch, this demotesAto a replica of the winner,B. And we setA -> replicaof = B, completing thereplicaofcycle.Originally posted by @PingXie in #573 (comment)