Use shard-id of the master if the replica is inconsistent with master #13428

sundb · 2024-07-18T10:42:35Z

Issue

If the master is an older version, the node will not send the shard-id extension, it will use a randomly generated shard-id.
If the replica is a newer version but its master is an older version, the node will send the shard-id.
However, before this PR, we would update its shard-id, leading to a discrepancy between the master and replica nodes' shard-id, causing the replica to fail when loading the cluster node conf on restart.

Solution

If the replica's shard -id is found to be inconsistent with the master's, update it to match the master's shard-id

sundb · 2024-08-08T13:37:01Z

src/cluster_legacy.c

+            for (int i = 0; i < clusterNodeNumSlaves(node); i++) {
+                clusterNode *slavenode = clusterNodeGetSlave(node, i);
+                if (memcmp(slavenode->shard_id, shard_id, CLUSTER_NAMELEN) != 0)
+                    assignShardIdToNode(slavenode, shard_id, CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
+            }


Although all masters and replicas can synchronize shardid eventually even without this update (sync all replicas' shardid when the master's shardid changes), if the process is shut down in the middle of this process, the cluster config may still be corrupted and fail to start.

stevelipinski · 2024-08-08T21:01:39Z

Consider #13468 to cover some other edge cases

sundb · 2024-08-19T15:01:29Z

close this due to dup with #13468

PR #13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on #13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>

…#13468) PR redis#13428 doesn't fully resolve an issue where corruption errors can still occur on loading of cluster.nodes file - seen on upgrade where there were no shard_ids (from old Redis), 7.2.5 loading generated new random ones, and persisted them to the file before gossip/handshake could propagate the correct ones (or some other nodes unreachable). This results in a primary/replica having differing shard_id in the cluster.nodes and then the server cannot startup - reports corruption. This PR builds on redis#13428 by simply ignoring the replica's shard_id in cluster.nodes (if it exists), and uses the replica's primary's shard_id. Additional handling was necessary to cover the case where the replica appears before the primary in cluster.nodes, where it will first use a generated shard_id for the primary, and then correct after it loads the primary cluster.nodes entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com>

sundb added 5 commits July 18, 2024 18:33

Use shard-id of the master if the replica is inconsistent with master

1fc3565

Update the shards of master or replicas when inconsistent

78fb033

Update comment

258dd82

Format

fe5cd2c

Format

d0f2eb6

sundb marked this pull request as ready for review July 24, 2024 06:49

Merge branch 'unstable' into cluster_shard_id_inconsistent

15e4b6c

ihussainbadshah mentioned this pull request Jul 31, 2024

[BUG] Unrecoverable error: corrupted cluster config file "5270a2453e7db28eee53f976faca81306e649b19... #13456

Closed

sundb added 2 commits August 5, 2024 18:04

Rename assignShardToNode to assignShardIdToNode

840950c

Updating myself is already handled above

aa976d3

sundb commented Aug 8, 2024

View reviewed changes

stevelipinski mentioned this pull request Aug 8, 2024

Avoid cluster.nodes load corruption due to shard-id generation #13468

Merged

sundb mentioned this pull request Aug 19, 2024

Use shard-id of the master if the replica does not support shard-id #12805

Merged

sundb closed this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use shard-id of the master if the replica is inconsistent with master #13428

Use shard-id of the master if the replica is inconsistent with master #13428

Uh oh!

sundb commented Jul 18, 2024

Uh oh!

sundb Aug 8, 2024

Uh oh!

stevelipinski commented Aug 8, 2024

Uh oh!

sundb commented Aug 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use shard-id of the master if the replica is inconsistent with master #13428

Use shard-id of the master if the replica is inconsistent with master #13428

Uh oh!

Conversation

sundb commented Jul 18, 2024

Issue

Solution

Uh oh!

sundb Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

stevelipinski commented Aug 8, 2024

Uh oh!

sundb commented Aug 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants