-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d 7.2 In redis#10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like redis#12538. In this PR, we remove the assert and replace it with plain if. Fixes redis#12603.
|
@PingXie please also take a look, thanks |
|
@madolson @PingXie after this was merged, i found another problem. In this case, the node (7.2) 's nodes.conf will report Unrecoverable error: since the old server did not gossip the shard id, and we have these check when we are loading a node.conf: any ideas on this? |
|
Does this mean, there are still issues with migration? Or is this issue unrelated? |
|
Is there an issue open tracking the corrupted cluster config file problem that I can follow? |
|
please don't worry, we are dealing with it. |
…d 7.2 (#12604) In #10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like #12538. In this PR, we remove the assert and replace it with an unconditional removal. (cherry picked from commit e5ef161)
|
@enjoy-binbin what's the status here? |
|
I'm not quite sure what the correct fix is and am waiting for @madolson opinion. |
|
@PingXie Would you be able to take a look at this issue? |
|
ok, i think i finally have thought of a way. If the replica nodes does not send the shard-id, it means they do not support it, so set their shard-id to the shard-id of the master node. |
|
Sorry I missed this thread. The fix for the original crash seems fine to me. I will comment on the new fix in its own thread. |
In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.
It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.
In this PR, we just remove the assert and the search, and just using
removeChannelsInSlot since it will do an return if there are no active
subscription for a given slot.
Fixes #12603.