Use shard-id of the master if the replica does not support shard-id #12805

enjoy-binbin · 2023-11-23T14:24:01Z

If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)

Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config file error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.

A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.

In this PR, when process the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.

This fix #12761.

Release notes
Fix a crash related to rebalancing clusters when running nodes on both 7.0 and 7.2.

If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a `corrupted cluster config file` error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. In this PR, when process the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master. This fix redis#12761.

enjoy-binbin · 2023-11-23T14:25:54Z

This may not be perfect, since each 7.2 node may see a different shard-id,
but i can't think of other ways for the time being.

src/cluster_legacy.c

PingXie · 2023-11-26T21:43:58Z

This may not be perfect, since each 7.2 node may see a different shard-id, but i can't think of other ways for the time being.

This fix appears to be effective. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align.

Can you consider adding a unit test to validate the fix? I think we could add a new DEBUG subcommand to emulate the 7.0 behavior.

enjoy-binbin · 2023-11-27T02:54:49Z

@PingXie thanks for the review, and the text, i added the text to the top comment since it is a great summary. I'll try and see how to write the test. It would be great if we make #10214 happen.

src/cluster_legacy.c

PingXie

Thanks Binbin!

src/cluster_legacy.c

hpatro · 2023-12-11T23:09:20Z

@PingXie thanks for the review, and the text, i added the text to the top comment since it is a great summary. I'll try and see how to write the test. It would be great if we make #10214 happen.

@madolson mentioned she was experimenting cross version testing functionality. Ping.

zuiderkwast

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

enjoy-binbin · 2023-12-15T10:50:22Z

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

I think there should be no crash during the process, it is just the nodes.conf saved during the process cannot be loaded by 7.2 nodes.

jdziemidowicz · 2023-12-21T10:45:17Z

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

This causes no crash when running mixed version clusters. But this bug corrupts nodes.conf, causing a crash when trying to restart any 7.2 node in such clusters. IMHO this also causes serious problems when doing rolling updates, especially for large clusters.

madolson · 2024-01-07T04:26:44Z

Can you consider adding a unit test to validate the fix? I think we could add a new DEBUG subcommand to emulate the 7.0 behavior.

I think a unit test would be helpful, but we have inadequate testing between versions today which is a bigger gap. I think we can address that separately, just to make sure this fix goes out.

…12805) If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a corrupted cluster config file error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. In this PR, when processing the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master. (cherry picked from commit 4cae66f)

…edis#12805) If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a corrupted cluster config file error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. In this PR, when processing the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master.

zygisa · 2024-05-28T11:48:19Z

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

This causes no crash when running mixed version clusters. But this bug corrupts nodes.conf, causing a crash when trying to restart any 7.2 node in such clusters. IMHO this also causes serious problems when doing rolling updates, especially for large clusters.

👋 Is there any idea how to fix the issue where 7.2 node crashes after the restart? If so, is there any timeline for the fix? This is complicating the rolling upgrade process significantly

jdziemidowicz · 2024-05-29T20:24:06Z

👋 Is there any idea how to fix the issue where 7.2 node crashes after the restart? If so, is there any timeline for the fix? This is complicating the rolling upgrade process significantly

Manually removing all occurences of shard-id from cluster config file allows the 7.2 node to start. Not ideal, but at least allows manual intervention. When all the nodes in a cluster are running 7.2, the problem goes away.

oranagra · 2024-05-30T11:17:30Z

@zygisa doesn't 7.2.4 solves the problem?

zygisa · 2024-05-30T14:24:53Z

@zygisa doesn't 7.2.4 solves the problem?

As far as we've seen this issue only presents itself when running multi version (mix of 7.0.10 and 7.2.4) cluster, after the upgrade to 7.2.4 is completed everything works fine. Removing shard-id from cluster config files helps restore any nodes that crashed as well, as @rraptorr suggested 🙏

However, we ran into another issue during the upgrade process. Not sure if it's related.

jdziemidowicz · 2024-05-30T20:38:19Z

@zygisa doesn't 7.2.4 solves the problem?

It was supposed to but it doesn't. 7.2.5 also has this problem. It seems #12761 is still unfixed.

sundb · 2024-08-19T14:50:33Z

@zygisa doesn't 7.2.4 solves the problem?

It was supposed to but it doesn't. 7.2.5 also has this problem. It seems #12761 is still unfixed.

this will be fixed by #13468

…edis#12805) If there are nodes in the cluster that do not support shard-id, they will gossip shard-id. From the perspective of nodes that support shard-id, their shard-id is meaningless (since shard-id is randomly generated when we create a node.) Nodes that support shard-id will save the shard-id information in nodes.conf. If the node is restarted according to nodes.conf, the server will report a corrupted cluster config file error. Because auxShardIdSetter will reject configurations with inconsistent master-replica shard-ids. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align. In this PR, when processing the gossip, if sender is a replica and does not support shard-id, set the shard_id to the shard_id of its master.

enjoy-binbin requested review from hpatro and madolson November 23, 2023 14:24

enjoy-binbin mentioned this pull request Nov 23, 2023

Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

Merged

handle more n->slaveof

a34c012

PingXie reviewed Nov 26, 2023

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin mentioned this pull request Nov 27, 2023

[CRASH] Assertion Failed when running rebalance command when upgrading from 7.0.11 to 7.2.2 #12695

Closed

code review from PingXie

7531c2a

PingXie reviewed Nov 27, 2023

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

clusterNodeGetSlaveofMaster -> clusterNodeGetMaster

9804a64

PingXie approved these changes Nov 27, 2023

View reviewed changes

enjoy-binbin requested a review from zuiderkwast December 11, 2023 16:15

hpatro reviewed Dec 11, 2023

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

zuiderkwast approved these changes Dec 14, 2023

View reviewed changes

madolson approved these changes Jan 7, 2024

View reviewed changes

madolson merged commit 4cae66f into redis:unstable Jan 7, 2024

madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Jan 7, 2024

enjoy-binbin deleted the fix_shardid_with_old_version branch January 8, 2024 01:49

oranagra mentioned this pull request Jan 9, 2024

Release 7.2.4 #12923

Merged

sundb mentioned this pull request Jul 18, 2024

Use shard-id of the master if the replica is inconsistent with master #13428

Closed

sundb added this to Redis 7.4 Aug 15, 2025

github-project-automation bot moved this to Todo in Redis 7.4 Aug 15, 2025

sundb removed this from Redis 8.2 Aug 15, 2025

sundb moved this from Todo to Done in Redis 7.4 Aug 15, 2025

Use shard-id of the master if the replica does not support shard-id #12805

Use shard-id of the master if the replica does not support shard-id #12805

Uh oh!

Conversation

enjoy-binbin commented Nov 23, 2023 • edited by madolson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Nov 23, 2023

Uh oh!

Uh oh!

PingXie commented Nov 26, 2023

Uh oh!

enjoy-binbin commented Nov 27, 2023

Uh oh!

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hpatro commented Dec 11, 2023

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Dec 15, 2023

Uh oh!

jdziemidowicz commented Dec 21, 2023

Uh oh!

madolson commented Jan 7, 2024

Uh oh!

zygisa commented May 28, 2024

Uh oh!

jdziemidowicz commented May 29, 2024

Uh oh!

oranagra commented May 30, 2024

Uh oh!

zygisa commented May 30, 2024

Uh oh!

jdziemidowicz commented May 30, 2024

Uh oh!

sundb commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

enjoy-binbin commented Nov 23, 2023 •

edited by madolson

Loading

sundb commented Aug 19, 2024 •

edited

Loading