Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

enjoy-binbin · 2023-09-23T08:36:22Z

In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.

In this PR, we just remove the assert and the search, and just using
removeChannelsInSlot since it will do an return if there are no active
subscription for a given slot.

Fixes #12603.

Release notes
Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2

…d 7.2 In redis#10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like redis#12538. In this PR, we remove the assert and replace it with plain if. Fixes redis#12603.

enjoy-binbin · 2023-09-23T08:44:59Z

@PingXie please also take a look, thanks

src/cluster.c

enjoy-binbin · 2023-10-12T07:43:29Z

@madolson @PingXie after this was merged, i found another problem. In this case, the node (7.2) 's nodes.conf will report Unrecoverable error:

8301:M 12 Oct 2023 15:41:42.559 # Unrecoverable error: corrupted cluster config file "0f5d7138f025c12ff3db57a7844f8d53e34ef5d6 127.0.0.1:30002@40002,,tls-port=0,shard-id=0158d144b77f7f6b9b150c38ad53794647100d83 master - 0 1697092244000 8 connected 2731-10922
".

since the old server did not gossip the shard id, and we have these check when we are loading a node.conf:

first:
              else if (clusterGetNodesInMyShard(master) != NULL &&
                       memcmp(master->shard_id, n->shard_id, CLUSTER_NAMELEN) != 0)
            {
                /* If the primary has been added to a shard, make sure this
                 * node has the same persisted shard id as the primary. */
                goto fmterr;
            }

second:
int auxShardIdSetter(clusterNode *n, void *value, int length) {
    if (verifyClusterNodeId(value, length) == C_ERR) {
        return C_ERR;
    }
    memcpy(n->shard_id, value, CLUSTER_NAMELEN);
    /* if n already has replicas, make sure they all agree
     * on the shard id */
    for (int i = 0; i < n->numslaves; i++) {
        if (memcmp(n->slaves[i]->shard_id, n->shard_id, CLUSTER_NAMELEN) != 0) {
            return C_ERR;
        }
    }
    clusterAddNodeToShard(value, n);
    return C_OK;
}

any ideas on this?

salarali · 2023-10-16T14:25:37Z

Does this mean, there are still issues with migration? Or is this issue unrelated?

jdork0 · 2023-10-16T17:15:24Z

Is there an issue open tracking the corrupted cluster config file problem that I can follow?

enjoy-binbin · 2023-10-16T18:15:34Z

please don't worry, we are dealing with it.

…d 7.2 (#12604) In #10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like #12538. In this PR, we remove the assert and replace it with an unconditional removal. (cherry picked from commit e5ef161)

oranagra · 2023-11-01T06:55:55Z

@enjoy-binbin what's the status here?
this #12604 (comment) suggests that there's still some issue.
@madolson FYI

enjoy-binbin · 2023-11-01T07:17:26Z

I'm not quite sure what the correct fix is and am waiting for @madolson opinion.
I can take a look again.

hpatro · 2023-11-21T18:14:27Z

@PingXie Would you be able to take a look at this issue?

enjoy-binbin · 2023-11-23T14:00:16Z

ok, i think i finally have thought of a way. If the replica nodes does not send the shard-id, it means they do not support it, so set their shard-id to the shard-id of the master node.

#12805

PingXie · 2023-11-26T21:29:32Z

Sorry I missed this thread. The fix for the original crash seems fine to me. I will comment on the new fix in its own thread.

enjoy-binbin requested a review from madolson September 23, 2023 08:36

hpatro reviewed Oct 11, 2023

View reviewed changes

src/cluster.c Outdated Show resolved Hide resolved

enjoy-binbin added 2 commits October 11, 2023 14:04

Merge remote-tracking branch 'upstream/unstable' into fix_crash_cluster

3ed6460

code review from hpatro

d1bed96

oranagra assigned madolson Oct 11, 2023

madolson approved these changes Oct 12, 2023

View reviewed changes

madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Oct 12, 2023

madolson merged commit e5ef161 into redis:unstable Oct 12, 2023

enjoy-binbin deleted the fix_crash_cluster branch October 12, 2023 06:04

oranagra mentioned this pull request Oct 17, 2023

Release 7.2.2 #12665

Merged

enjoy-binbin mentioned this pull request Nov 14, 2023

[BUG]"corrupted cluster config file" on redis 7.2.3 error when running redis cluster with mixed 7.0 and 7.2 nodes #12761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

Uh oh!

enjoy-binbin commented Sep 23, 2023 •

edited by madolson

Loading

Uh oh!

enjoy-binbin commented Sep 23, 2023

Uh oh!

Uh oh!

enjoy-binbin commented Oct 12, 2023

Uh oh!

salarali commented Oct 16, 2023

Uh oh!

jdork0 commented Oct 16, 2023

Uh oh!

enjoy-binbin commented Oct 16, 2023

Uh oh!

oranagra commented Nov 1, 2023

Uh oh!

enjoy-binbin commented Nov 1, 2023

Uh oh!

hpatro commented Nov 21, 2023

Uh oh!

enjoy-binbin commented Nov 23, 2023 •

edited

Loading

Uh oh!

PingXie commented Nov 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

Uh oh!

Conversation

enjoy-binbin commented Sep 23, 2023 • edited by madolson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Sep 23, 2023

Uh oh!

Uh oh!

enjoy-binbin commented Oct 12, 2023

Uh oh!

salarali commented Oct 16, 2023

Uh oh!

jdork0 commented Oct 16, 2023

Uh oh!

enjoy-binbin commented Oct 16, 2023

Uh oh!

oranagra commented Nov 1, 2023

Uh oh!

enjoy-binbin commented Nov 1, 2023

Uh oh!

hpatro commented Nov 21, 2023

Uh oh!

enjoy-binbin commented Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingXie commented Nov 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

enjoy-binbin commented Sep 23, 2023 •

edited by madolson

Loading

enjoy-binbin commented Nov 23, 2023 •

edited

Loading