-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Use shard-id of the master if the replica is inconsistent with master #13428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| for (int i = 0; i < clusterNodeNumSlaves(node); i++) { | ||
| clusterNode *slavenode = clusterNodeGetSlave(node, i); | ||
| if (memcmp(slavenode->shard_id, shard_id, CLUSTER_NAMELEN) != 0) | ||
| assignShardIdToNode(slavenode, shard_id, CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although all masters and replicas can synchronize shardid eventually even without this update (sync all replicas' shardid when the master's shardid changes), if the process is shut down in the middle of this process, the cluster config may still be corrupted and fail to start.
|
Consider #13468 to cover some other edge cases |
|
close this due to dup with #13468 |
PR #13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on #13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>
PR #13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on #13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>
PR #13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on #13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>
…#13468) PR redis#13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on redis#13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>
…#13468) PR redis#13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on redis#13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>
After #12805
Fix #12761
Issue
However, before this PR, we would update its shard-id, leading to a discrepancy between the master and replica nodes' shard-id, causing the replica to fail when loading the cluster node conf on restart.
Solution
If the replica's shard -id is found to be inconsistent with the master's, update it to match the master's shard-id