Skip to content

Stale PONG message causes incorrect replicaof updates leading to replicaof loops #1015

Description

@PingXie

I have a theory about how this could happen.

  1. We had a stale PONG message issue, which was fixed in commit 28976a9
    if (sender->configEpoch > sender_claimed_config_epoch) {
  2. However we didn't bail after detecting this stale message. We proceed to
    if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
  3. And then update sender's replicaof based on the stale message at:
    sender->replicaof = sender_claimed_primary;

Now, imagine the following scenario

[T0] Three nodes: primary A with replica B, and an observer node N
[T1] B's PONG message to N claiming A is its primary gets stuck somewhere on the way to N
[T2] B becomes primary after a manual failover and then notifies A (and N but that message will get stuck behind the PONG message at T1)
[T3] A becomes a replica of B
[T4] A, now a replica of B, sends PING to N, which goes through the following steps that end up "promote" B to a primary, indirectly

  1. if (sender) {
  2. if (sender_last_reported_as_primary) {
  3. if (sender_claimed_primary && areInSameShard(sender_claimed_primary, sender)) {
  4. clusterSetNodeAsPrimary(sender_claimed_primary);

    and sets A's replicaof to B
  5. if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
  6. sender->replicaof = sender_claimed_primary;

    [T5] Finally, B's PONG message to N from [T1] arrives on N and it goes through the following steps
  7. if (sender) {
  8. /* Node is a replica. */

    Due to step 4, B got promoted to primary, implicitly
  9. if (sender_last_reported_as_primary) {

    However the epoch is stale, which is correctly handled
  10. if (sender->configEpoch > sender_claimed_config_epoch) {
  11. "Ignore stale message from %.40s (%s) in shard %.40s;"

    We don't bail but instead continue to
  12. if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {

    and finally updates B->replicaof to A, completing the loop
  13. sender->replicaof = sender_claimed_primary;

I have seen stale messages in the past and I also notice that the latest failure in the codecov run, which could alter the timing quite a bit so I think this theory is very plausible.

The fix would be to bail immediately after detecting the stale message

"Ignore stale message from %.40s (%s) in shard %.40s;"

BTW, we have another undetected stale message issue (#798)

Originally posted by @PingXie in #573 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions