Skip to content

Conversation

@sundb
Copy link
Collaborator

@sundb sundb commented Jul 18, 2024

After #12805
Fix #12761

Issue

  1. If the master is an older version, the node will not send the shard-id extension, it will use a randomly generated shard-id.
  2. If the replica is a newer version but its master is an older version, the node will send the shard-id.
    However, before this PR, we would update its shard-id, leading to a discrepancy between the master and replica nodes' shard-id, causing the replica to fail when loading the cluster node conf on restart.

Solution

If the replica's shard -id is found to be inconsistent with the master's, update it to match the master's shard-id

@sundb sundb marked this pull request as ready for review July 24, 2024 06:49
Comment on lines +928 to +932
for (int i = 0; i < clusterNodeNumSlaves(node); i++) {
clusterNode *slavenode = clusterNodeGetSlave(node, i);
if (memcmp(slavenode->shard_id, shard_id, CLUSTER_NAMELEN) != 0)
assignShardIdToNode(slavenode, shard_id, CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although all masters and replicas can synchronize shardid eventually even without this update (sync all replicas' shardid when the master's shardid changes), if the process is shut down in the middle of this process, the cluster config may still be corrupted and fail to start.

@stevelipinski
Copy link
Contributor

Consider #13468 to cover some other edge cases

@sundb
Copy link
Collaborator Author

sundb commented Aug 19, 2024

close this due to dup with #13468

@sundb sundb closed this Aug 19, 2024
sundb added a commit that referenced this pull request Sep 12, 2024
PR #13428 doesn't fully resolve an issue where corruption errors can
still occur on loading of cluster.nodes file - seen on upgrade where
there were no shard_ids (from old Redis), 7.2.5 loading generated new
random ones, and persisted them to the file before gossip/handshake
could propagate the correct ones (or some other nodes unreachable).
This results in a primary/replica having differing shard_id in the
cluster.nodes and then the server cannot startup - reports corruption.

This PR builds on #13428 by simply ignoring the replica's shard_id in
cluster.nodes (if it exists), and uses the replica's primary's shard_id.
Additional handling was necessary to cover the case where the replica
appears before the primary in cluster.nodes, where it will first use a
generated shard_id for the primary, and then correct after it loads the
primary cluster.nodes entry.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
YaacovHazan pushed a commit that referenced this pull request Oct 30, 2024
PR #13428 doesn't fully resolve an issue where corruption errors can
still occur on loading of cluster.nodes file - seen on upgrade where
there were no shard_ids (from old Redis), 7.2.5 loading generated new
random ones, and persisted them to the file before gossip/handshake
could propagate the correct ones (or some other nodes unreachable).
This results in a primary/replica having differing shard_id in the
cluster.nodes and then the server cannot startup - reports corruption.

This PR builds on #13428 by simply ignoring the replica's shard_id in
cluster.nodes (if it exists), and uses the replica's primary's shard_id.
Additional handling was necessary to cover the case where the replica
appears before the primary in cluster.nodes, where it will first use a
generated shard_id for the primary, and then correct after it loads the
primary cluster.nodes entry.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
YaacovHazan pushed a commit that referenced this pull request Jan 6, 2025
PR #13428 doesn't fully resolve an issue where corruption errors can
still occur on loading of cluster.nodes file - seen on upgrade where
there were no shard_ids (from old Redis), 7.2.5 loading generated new
random ones, and persisted them to the file before gossip/handshake
could propagate the correct ones (or some other nodes unreachable).
This results in a primary/replica having differing shard_id in the
cluster.nodes and then the server cannot startup - reports corruption.

This PR builds on #13428 by simply ignoring the replica's shard_id in
cluster.nodes (if it exists), and uses the replica's primary's shard_id.
Additional handling was necessary to cover the case where the replica
appears before the primary in cluster.nodes, where it will first use a
generated shard_id for the primary, and then correct after it loads the
primary cluster.nodes entry.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
…#13468)

PR redis#13428 doesn't fully resolve an issue where corruption errors can
still occur on loading of cluster.nodes file - seen on upgrade where
there were no shard_ids (from old Redis), 7.2.5 loading generated new
random ones, and persisted them to the file before gossip/handshake
could propagate the correct ones (or some other nodes unreachable).
This results in a primary/replica having differing shard_id in the
cluster.nodes and then the server cannot startup - reports corruption.

This PR builds on redis#13428 by simply ignoring the replica's shard_id in
cluster.nodes (if it exists), and uses the replica's primary's shard_id.
Additional handling was necessary to cover the case where the replica
appears before the primary in cluster.nodes, where it will first use a
generated shard_id for the primary, and then correct after it loads the
primary cluster.nodes entry.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…#13468)

PR redis#13428 doesn't fully resolve an issue where corruption errors can
still occur on loading of cluster.nodes file - seen on upgrade where
there were no shard_ids (from old Redis), 7.2.5 loading generated new
random ones, and persisted them to the file before gossip/handshake
could propagate the correct ones (or some other nodes unreachable).
This results in a primary/replica having differing shard_id in the
cluster.nodes and then the server cannot startup - reports corruption.

This PR builds on redis#13428 by simply ignoring the replica's shard_id in
cluster.nodes (if it exists), and uses the replica's primary's shard_id.
Additional handling was necessary to cover the case where the replica
appears before the primary in cluster.nodes, where it will first use a
generated shard_id for the primary, and then correct after it loads the
primary cluster.nodes entry.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]"corrupted cluster config file" on redis 7.2.3 error when running redis cluster with mixed 7.0 and 7.2 nodes

2 participants