-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Use shard-id of the master if the replica does not support shard-id #12805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shard-id of the master if the replica does not support shard-id #12805
Conversation
If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a `corrupted cluster config file` error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. In this PR, when process the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master. This fix redis#12761.
|
This may not be perfect, since each 7.2 node may see a different shard-id, |
This fix appears to be effective. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. Can you consider adding a unit test to validate the fix? I think we could add a new |
PingXie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Binbin!
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.
I think there should be no crash during the process, it is just the nodes.conf saved during the process cannot be loaded by 7.2 nodes. |
This causes no crash when running mixed version clusters. But this bug corrupts nodes.conf, causing a crash when trying to restart any 7.2 node in such clusters. IMHO this also causes serious problems when doing rolling updates, especially for large clusters. |
I think a unit test would be helpful, but we have inadequate testing between versions today which is a bigger gap. I think we can address that separately, just to make sure this fix goes out. |
…12805) If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a corrupted cluster config file error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. In this PR, when processing the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master. (cherry picked from commit 4cae66f)
…edis#12805) If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a corrupted cluster config file error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. In this PR, when processing the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master.
👋 Is there any idea how to fix the issue where 7.2 node crashes after the restart? If so, is there any timeline for the fix? This is complicating the rolling upgrade process significantly |
Manually removing all occurences of shard-id from cluster config file allows the 7.2 node to start. Not ideal, but at least allows manual intervention. When all the nodes in a cluster are running 7.2, the problem goes away. |
|
@zygisa doesn't 7.2.4 solves the problem? |
As far as we've seen this issue only presents itself when running multi version (mix of 7.0.10 and 7.2.4) cluster, after the upgrade to 7.2.4 is completed everything works fine. Removing shard-id from cluster config files helps restore any nodes that crashed as well, as @rraptorr suggested 🙏 However, we ran into another issue during the upgrade process. Not sure if it's related. |
…edis#12805) If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a corrupted cluster config file error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. In this PR, when processing the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master.
If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)
Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config fileerror. Because auxShardIdSetter will rejectconfigurations with inconsistent master-replica shard-ids.
A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.
In this PR, when process the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.
This fix #12761.