I have a theory about how this could happen.
- We had a stale
PONG message issue, which was fixed in commit 28976a9
|
if (sender->configEpoch > sender_claimed_config_epoch) { |
- However we didn't bail after detecting this stale message. We proceed to
|
if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) { |
- And then update
sender's replicaof based on the stale message at:
|
sender->replicaof = sender_claimed_primary; |
Now, imagine the following scenario
[T0] Three nodes: primary A with replica B, and an observer node N
[T1] B's PONG message to N claiming A is its primary gets stuck somewhere on the way to N
[T2] B becomes primary after a manual failover and then notifies A (and N but that message will get stuck behind the PONG message at T1)
[T3] A becomes a replica of B
[T4] A, now a replica of B, sends PING to N, which goes through the following steps that end up "promote" B to a primary, indirectly
-
|
if (sender_last_reported_as_primary) { |
|
if (sender_claimed_primary && areInSameShard(sender_claimed_primary, sender)) { |
|
clusterSetNodeAsPrimary(sender_claimed_primary); |
and sets A's replicaof to B
|
if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) { |
|
sender->replicaof = sender_claimed_primary; |
[T5] Finally, B's PONG message to N from [T1] arrives on N and it goes through the following steps
-
-
Due to step 4, B got promoted to primary, implicitly
|
if (sender_last_reported_as_primary) { |
However the epoch is stale, which is correctly handled
|
if (sender->configEpoch > sender_claimed_config_epoch) { |
|
"Ignore stale message from %.40s (%s) in shard %.40s;" |
We don't bail but instead continue to
|
if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) { |
and finally updates B->replicaof to A, completing the loop
|
sender->replicaof = sender_claimed_primary; |
I have seen stale messages in the past and I also notice that the latest failure in the codecov run, which could alter the timing quite a bit so I think this theory is very plausible.
The fix would be to bail immediately after detecting the stale message
|
"Ignore stale message from %.40s (%s) in shard %.40s;" |
BTW, we have another undetected stale message issue (#798)
Originally posted by @PingXie in #573 (comment)
I have a theory about how this could happen.
PONGmessage issue, which was fixed in commit 28976a9valkey/src/cluster_legacy.c
Line 3271 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3311 in 2b76c8f
sender'sreplicaofbased on the stale message at:valkey/src/cluster_legacy.c
Line 3317 in 2b76c8f
Now, imagine the following scenario
[
T0] Three nodes: primaryAwith replicaB, and an observer nodeN[
T1]B'sPONGmessage toNclaimingAis its primary gets stuck somewhere on the way toN[
T2]Bbecomes primary after a manual failover and then notifiesA(andNbut that message will get stuck behind thePONGmessage atT1)[
T3]Abecomes a replica ofB[
T4]A, now a replica ofB, sendsPINGtoN, which goes through the following steps that end up "promote"Bto a primary, indirectlyvalkey/src/cluster_legacy.c
Line 3257 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3267 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3269 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3281 in 2b76c8f
and sets
A'sreplicaoftoBvalkey/src/cluster_legacy.c
Line 3311 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3317 in 2b76c8f
[
T5] Finally,B'sPONGmessage toNfrom [T1] arrives onNand it goes through the following stepsvalkey/src/cluster_legacy.c
Line 3257 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3264 in 2b76c8f
Due to step 4,
Bgot promoted to primary, implicitlyvalkey/src/cluster_legacy.c
Line 3267 in 2b76c8f
However the epoch is stale, which is correctly handled
valkey/src/cluster_legacy.c
Line 3271 in 2b76c8f
valkey/src/cluster_legacy.c
Line 3273 in 2b76c8f
We don't bail but instead continue to
valkey/src/cluster_legacy.c
Line 3311 in 2b76c8f
and finally updates
B->replicaoftoA, completing the loopvalkey/src/cluster_legacy.c
Line 3317 in 2b76c8f
I have seen stale messages in the past and I also notice that the latest failure in the codecov run, which could alter the timing quite a bit so I think this theory is very plausible.
The fix would be to bail immediately after detecting the stale message
valkey/src/cluster_legacy.c
Line 3273 in 2b76c8f
BTW, we have another undetected stale message issue (#798)
Originally posted by @PingXie in #573 (comment)