Skip to content

Conversation

@jdork0
Copy link
Contributor

@jdork0 jdork0 commented Mar 24, 2025

Close #13868

This bug was introduced by #13468

Issue

To maintain compatibility with older versions that do not support shardid, when a replica passes a shardid, we also update the master’s shardid accordingly.

However, when both the master and replica support shardid, an issue arises: in one moment, the master may pass a shardid, causing us to update both the master and all its replicas to match the master’s shardid. But if the replica later passes a different shardid, we would then update the master’s shardid again, leading to continuous changes in shardid.

Solution

Regardless of the situation, we always ensure that the replica’s shardid remains consistent with the master’s shardid.

@sundb
Copy link
Collaborator

sundb commented Mar 26, 2025

@jdork0 It seems that this fix is not correct, there are two scenarios:

  1. master supports shardid but replica doesn't.
    We have handled this situation correctly. In any case, we need to update the shardid of all replicas under this Master
  2. master doesn't support shardid but replica does.
    This scene is actually being fixed by this PR,
    If the replica supports shardid, the replica will give a shardid different from its master.
    If we update it, its shardid and master will be different.
    Therefore, I think we should verify whether the incoming shardid is consistent with the master at any time, and we will update it only if it is consistent.
        if (node->slaveof == NULL) {
            assignShardIdToNode(node, shard_id, CLUSTER_TODO_SAVE_CONFIG);
            for (int i = 0; i < clusterNodeNumSlaves(node); i++) {
                clusterNode *slavenode = clusterNodeGetSlave(node, i);
                if (memcmp(slavenode->shard_id, shard_id, CLUSTER_NAMELEN) != 0)
                    assignShardIdToNode(slavenode, shard_id, CLUSTER_TODO_SAVE_CONFIG|CLUSTER_TODO_FSYNC_CONFIG);
            }
        } else if (memcmp(node->slaveof->shard_id, shard_id, CLUSTER_NAMELEN) == 0) {
            assignShardIdToNode(node, shard_id, CLUSTER_TODO_SAVE_CONFIG);
        }

@jdork0
Copy link
Contributor Author

jdork0 commented Mar 26, 2025

@sundb, I think that looks ok. I'll do some testing and update the PR.

@jdork0
Copy link
Contributor Author

jdork0 commented Mar 26, 2025

@sundb, updated. I didn't run into any problems in my testing with the proposed change.

Comment on lines 917 to 922
* 1. Master supports but the replica does not.
* We might first update the replica's shard-id to the master's randomly
* generated shard-id. Then, when the master's shard-id arrives, we must
* also update all its replicas.
* All replicas under master will receive master shard-id. When coming
* from a release with no shard-id the master shard-id will be randomly
* generated.
* 2. If the master does not support but the replica does.
* We also need to synchronize the master's shard-id with the replica.
* 3. If neither of master and replica supports it.
* The master will have a randomly generated shard-id and will update
* the replica to match the master's shard-id. */
* Only update replica shard-id after master has provided one. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

        /* We always make our best effort to keep the shard-id consistent
         * between the master and its replicas:
         *
         * 1. When updating the master's shard-id, we simultaneously update the
         *    shard-id of all its replicas to ensure consistency.
         * 2. When updating replica's shard-id, if it differs from its master's shard-id,
         *    we discard this replica's shard-id and continue using master's shard-id.
         *    This applies even if the master does not support shard-id, in which
         *    case we rely on the master's randomly generated shard-id. */

@jdork0
Copy link
Contributor Author

jdork0 commented Mar 27, 2025

@sundb, is there any timeframe where we could expect this fix to make it back into 7.2?

@sundb sundb merged commit aa8e2d1 into redis:unstable Mar 30, 2025
18 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in Redis 8.2 Mar 30, 2025
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
Close redis#13868

This bug was introduced by redis#13468

## Issue
To maintain compatibility with older versions that do not support
shardid, when a replica passes a shardid, we also update the master’s
shardid accordingly.

However, when both the master and replica support shardid, an issue
arises: in one moment, the master may pass a shardid, causing us to
update both the master and all its replicas to match the master’s
shardid. But if the replica later passes a different shardid, we would
then update the master’s shardid again, leading to continuous changes in
shardid.

## Solution
Regardless of the situation, we always ensure that the replica’s shardid
remains consistent with the master’s shardid.
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
Close redis#13868

This bug was introduced by redis#13468

To maintain compatibility with older versions that do not support
shardid, when a replica passes a shardid, we also update the master’s
shardid accordingly.

However, when both the master and replica support shardid, an issue
arises: in one moment, the master may pass a shardid, causing us to
update both the master and all its replicas to match the master’s
shardid. But if the replica later passes a different shardid, we would
then update the master’s shardid again, leading to continuous changes in
shardid.

Regardless of the situation, we always ensure that the replica’s shardid
remains consistent with the master’s shardid.
@YaacovHazan YaacovHazan moved this from Todo to Done in Redis 7.4 Backport Apr 29, 2025
@YaacovHazan YaacovHazan moved this from Todo to Done in Redis 7.2 Backport Apr 29, 2025
@sundb sundb removed this from Redis 8.2 Aug 15, 2025
@sundb sundb added this to Redis 8.0 Aug 15, 2025
@sundb sundb moved this from Todo to Done in Redis 8.0 Aug 15, 2025
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
Close redis#13868

This bug was introduced by redis#13468

## Issue
To maintain compatibility with older versions that do not support
shardid, when a replica passes a shardid, we also update the master’s
shardid accordingly.

However, when both the master and replica support shardid, an issue
arises: in one moment, the master may pass a shardid, causing us to
update both the master and all its replicas to match the master’s
shardid. But if the replica later passes a different shardid, we would
then update the master’s shardid again, leading to continuous changes in
shardid.

## Solution
Regardless of the situation, we always ensure that the replica’s shardid
remains consistent with the master’s shardid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[BUG] cluster nodes file never stabilizes after cluster creation in 7.2.7

3 participants