Skip to content

Conversation

@enjoy-binbin
Copy link
Contributor

@enjoy-binbin enjoy-binbin commented Nov 23, 2023

If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)

Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config file error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.

A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.

In this PR, when process the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.

This fix #12761.

Release notes
Fix a crash related to rebalancing clusters when running nodes on both 7.0 and 7.2.

If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)

Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
`corrupted cluster config file` error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.

In this PR, when process the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.

This fix redis#12761.
@enjoy-binbin
Copy link
Contributor Author

This may not be perfect, since each 7.2 node may see a different shard-id,
but i can't think of other ways for the time being.

@PingXie
Copy link
Contributor

PingXie commented Nov 26, 2023

This may not be perfect, since each 7.2 node may see a different shard-id, but i can't think of other ways for the time being.

This fix appears to be effective. A cluster-wide consensus for the node's shard_id is not necessary. The key is maintaining consistency of the shard_id on each individual 7.2 node. As the cluster progressively upgrades to version 7.2, we can expect the shard_ids across all nodes to naturally converge and align.

Can you consider adding a unit test to validate the fix? I think we could add a new DEBUG subcommand to emulate the 7.0 behavior.

@enjoy-binbin
Copy link
Contributor Author

@PingXie thanks for the review, and the text, i added the text to the top comment since it is a great summary. I'll try and see how to write the test. It would be great if we make #10214 happen.

Copy link
Contributor

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Binbin!

@hpatro
Copy link
Contributor

hpatro commented Dec 11, 2023

@PingXie thanks for the review, and the text, i added the text to the top comment since it is a great summary. I'll try and see how to write the test. It would be great if we make #10214 happen.

@madolson mentioned she was experimenting cross version testing functionality. Ping.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

@enjoy-binbin
Copy link
Contributor Author

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

I think there should be no crash during the process, it is just the nodes.conf saved during the process cannot be loaded by 7.2 nodes.

@jdziemidowicz
Copy link

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

This causes no crash when running mixed version clusters. But this bug corrupts nodes.conf, causing a crash when trying to restart any 7.2 node in such clusters. IMHO this also causes serious problems when doing rolling updates, especially for large clusters.

@madolson madolson merged commit 4cae66f into redis:unstable Jan 7, 2024
@madolson madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Jan 7, 2024
@madolson
Copy link
Contributor

madolson commented Jan 7, 2024

Can you consider adding a unit test to validate the fix? I think we could add a new DEBUG subcommand to emulate the 7.0 behavior.

I think a unit test would be helpful, but we have inadequate testing between versions today which is a bigger gap. I think we can address that separately, just to make sure this fix goes out.

@enjoy-binbin enjoy-binbin deleted the fix_shardid_with_old_version branch January 8, 2024 01:49
@oranagra oranagra mentioned this pull request Jan 9, 2024
oranagra pushed a commit that referenced this pull request Jan 9, 2024
…12805)

If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)

Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config file error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.

A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.

In this PR, when processing the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.

(cherry picked from commit 4cae66f)
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…edis#12805)

If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)

Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config file error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.

A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.

In this PR, when processing the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.
@zygisa
Copy link

zygisa commented May 28, 2024

Does this bug fail rolling upgrade to 7.2 for every cluster? In that case, it's a serious bug. Let's merge and backport to 7.2.

This causes no crash when running mixed version clusters. But this bug corrupts nodes.conf, causing a crash when trying to restart any 7.2 node in such clusters. IMHO this also causes serious problems when doing rolling updates, especially for large clusters.

👋 Is there any idea how to fix the issue where 7.2 node crashes after the restart? If so, is there any timeline for the fix? This is complicating the rolling upgrade process significantly

@jdziemidowicz
Copy link

👋 Is there any idea how to fix the issue where 7.2 node crashes after the restart? If so, is there any timeline for the fix? This is complicating the rolling upgrade process significantly

Manually removing all occurences of shard-id from cluster config file allows the 7.2 node to start. Not ideal, but at least allows manual intervention. When all the nodes in a cluster are running 7.2, the problem goes away.

@oranagra
Copy link
Member

@zygisa doesn't 7.2.4 solves the problem?

@zygisa
Copy link

zygisa commented May 30, 2024

@zygisa doesn't 7.2.4 solves the problem?

As far as we've seen this issue only presents itself when running multi version (mix of 7.0.10 and 7.2.4) cluster, after the upgrade to 7.2.4 is completed everything works fine. Removing shard-id from cluster config files helps restore any nodes that crashed as well, as @rraptorr suggested 🙏

However, we ran into another issue during the upgrade process. Not sure if it's related.

@jdziemidowicz
Copy link

@zygisa doesn't 7.2.4 solves the problem?

It was supposed to but it doesn't. 7.2.5 also has this problem. It seems #12761 is still unfixed.

@sundb
Copy link
Collaborator

sundb commented Aug 19, 2024

@zygisa doesn't 7.2.4 solves the problem?

It was supposed to but it doesn't. 7.2.5 also has this problem. It seems #12761 is still unfixed.

this will be fixed by #13468

@sundb sundb added this to Redis 7.4 Aug 15, 2025
@sundb sundb removed this from Redis 8.2 Aug 15, 2025
@sundb sundb moved this from Todo to Done in Redis 7.4 Aug 15, 2025
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…edis#12805)

If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)

Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config file error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.

A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.

In this PR, when processing the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[BUG]"corrupted cluster config file" on redis 7.2.3 error when running redis cluster with mixed 7.0 and 7.2 nodes

9 participants