Skip to content

Conversation

@enjoy-binbin
Copy link
Contributor

@enjoy-binbin enjoy-binbin commented Sep 23, 2023

In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.

In this PR, we just remove the assert and the search, and just using
removeChannelsInSlot since it will do an return if there are no active
subscription for a given slot.

Fixes #12603.

Release notes
Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2

…d 7.2

In redis#10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like redis#12538.

In this PR, we remove the assert and replace it with plain if.
Fixes redis#12603.
@enjoy-binbin
Copy link
Contributor Author

@PingXie please also take a look, thanks

@madolson madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Oct 12, 2023
@madolson madolson merged commit e5ef161 into redis:unstable Oct 12, 2023
@enjoy-binbin enjoy-binbin deleted the fix_crash_cluster branch October 12, 2023 06:04
@enjoy-binbin
Copy link
Contributor Author

@madolson @PingXie after this was merged, i found another problem. In this case, the node (7.2) 's nodes.conf will report Unrecoverable error:

8301:M 12 Oct 2023 15:41:42.559 # Unrecoverable error: corrupted cluster config file "0f5d7138f025c12ff3db57a7844f8d53e34ef5d6 127.0.0.1:30002@40002,,tls-port=0,shard-id=0158d144b77f7f6b9b150c38ad53794647100d83 master - 0 1697092244000 8 connected 2731-10922
".

since the old server did not gossip the shard id, and we have these check when we are loading a node.conf:

first:
              else if (clusterGetNodesInMyShard(master) != NULL &&
                       memcmp(master->shard_id, n->shard_id, CLUSTER_NAMELEN) != 0)
            {
                /* If the primary has been added to a shard, make sure this
                 * node has the same persisted shard id as the primary. */
                goto fmterr;
            }

second:
int auxShardIdSetter(clusterNode *n, void *value, int length) {
    if (verifyClusterNodeId(value, length) == C_ERR) {
        return C_ERR;
    }
    memcpy(n->shard_id, value, CLUSTER_NAMELEN);
    /* if n already has replicas, make sure they all agree
     * on the shard id */
    for (int i = 0; i < n->numslaves; i++) {
        if (memcmp(n->slaves[i]->shard_id, n->shard_id, CLUSTER_NAMELEN) != 0) {
            return C_ERR;
        }
    }
    clusterAddNodeToShard(value, n);
    return C_OK;
}

any ideas on this?

@salarali
Copy link

Does this mean, there are still issues with migration? Or is this issue unrelated?

@jdork0
Copy link
Contributor

jdork0 commented Oct 16, 2023

Is there an issue open tracking the corrupted cluster config file problem that I can follow?

@enjoy-binbin
Copy link
Contributor Author

please don't worry, we are dealing with it.

@oranagra oranagra mentioned this pull request Oct 17, 2023
oranagra pushed a commit that referenced this pull request Oct 18, 2023
…d 7.2 (#12604)

In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.

In this PR, we remove the assert and replace it with an unconditional removal.

(cherry picked from commit e5ef161)
@oranagra
Copy link
Member

oranagra commented Nov 1, 2023

@enjoy-binbin what's the status here?
this #12604 (comment) suggests that there's still some issue.
@madolson FYI

@enjoy-binbin
Copy link
Contributor Author

I'm not quite sure what the correct fix is and am waiting for @madolson opinion.
I can take a look again.

@hpatro
Copy link
Contributor

hpatro commented Nov 21, 2023

@PingXie Would you be able to take a look at this issue?

@enjoy-binbin
Copy link
Contributor Author

enjoy-binbin commented Nov 23, 2023

ok, i think i finally have thought of a way. If the replica nodes does not send the shard-id, it means they do not support it, so set their shard-id to the shard-id of the master node.

#12805

@PingXie
Copy link
Contributor

PingXie commented Nov 26, 2023

Sorry I missed this thread. The fix for the original crash seems fine to me. I will comment on the new fix in its own thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[CRASH] While running the rebalancing command in a cluster mixture of 7.0.8 and 7.2.1

7 participants