Stale `PONG` message causes incorrect `replicaof` updates leading to `replicaof` loops



I have a theory about how this could happen. 

1. We had a stale `PONG` message issue, which was fixed in commit https://github.com/valkey-io/valkey/commit/28976a9003c6dd5cdd7225c5bc90743b4fcde13c
https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3271
2. However we didn't bail after detecting this stale message. We proceed to https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3311
3. And then update `sender`'s `replicaof` based on the stale message at: https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3317

Now, imagine the following scenario

[`T0`] Three nodes: primary `A` with replica `B`, and an observer node `N`
[`T1`] `B`'s `PONG` message to `N` claiming `A` is its primary gets stuck somewhere on the way to `N`
[`T2`] `B` becomes primary after a manual failover and then notifies `A` (and `N` but that message will get stuck behind the `PONG` message at `T1`)
[`T3`] `A` becomes a replica of `B`
[`T4`] `A`, now a replica of `B`, sends `PING` to `N`, which goes through the following steps that end up "promote" `B` to a primary, indirectly
1. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3257
2. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3267
3. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3269
4. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3281
and sets `A`'s `replicaof` to `B`
5. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3311
6. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3317
[`T5`] Finally, `B`'s `PONG` message to `N` from [`T1`] arrives on `N` and it goes through the following steps
1. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3257
2. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3264
Due to step 4, `B` got promoted to primary, implicitly
3. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3267 
However the epoch is stale, which is correctly handled
5. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3271
6. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3273
We don't bail but instead continue to 
7. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3311
and finally updates `B->replicaof` to `A`, completing the loop
8. https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3317

I have seen stale messages in the past and I also notice that the latest failure in the codecov run, which could alter the timing quite a bit so I think this theory is very plausible. 

The fix would be to bail immediately after detecting the stale message 

https://github.com/valkey-io/valkey/blob/2b76c8fbe2ccadaee2149e4b9b7c7df7ff0d07b6/src/cluster_legacy.c#L3273

BTW, we have another undetected stale message issue (#798)

_Originally posted by @PingXie in https://github.com/valkey-io/valkey/issues/573#issuecomment-2319617485_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stale `PONG` message causes incorrect `replicaof` updates leading to `replicaof` loops #1015

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Stale PONG message causes incorrect replicaof updates leading to replicaof loops #1015

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Stale `PONG` message causes incorrect `replicaof` updates leading to `replicaof` loops #1015