Skip to content

Fix PONG message processing for primary-ship tracking during failovers#13055

Merged
madolson merged 14 commits into
redis:unstablefrom
PingXie:issue-13018
Mar 5, 2024
Merged

Fix PONG message processing for primary-ship tracking during failovers#13055
madolson merged 14 commits into
redis:unstablefrom
PingXie:issue-13018

Conversation

@PingXie

@PingXie PingXie commented Feb 16, 2024

Copy link
Copy Markdown
Contributor

This commit updates the processing of PONG gossip messages in the cluster. When a node (B) becomes a replica due to a failover, its PONG messages include its new primary node's (A) information and B's configuration epoch is aligned with A's. This allows observer nodes to identify changes in primary-ship, addressing issues of intermediate states and enhancing cluster state consistency during topology changes.

Fix #13018

Ping Xie added 3 commits February 16, 2024 00:13
This commit updates the processing of PONG gossip messages in the cluster. When a node (B)
becomes a replica due to a failover, its PONG messages include its new primary node's (A)
information and B's configuration epoch is aligned with A's. This allows observer nodes to
identify changes in primary-ship, addressing issues of intermediate states and enhancing
cluster state consistency during topology changes.

Fix #13018

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/cluster_legacy.c
@PingXie PingXie requested a review from zuiderkwast February 20, 2024 04:33
@madolson

Copy link
Copy Markdown
Contributor

I worry about more edge cases on assuming where slots went based off of a node giving up ownership. For slot migration, we created the owner_not_claiming_slot to track that a node has given up a slot but nobody has taken ownership of it yet. This allowed us to keep full coverage and send the requests to nodes that SHOULD know the topology (since they were the last confirmed owners of the slot). @srgsanky It seems like that might solve this problem as well?

@PingXie

PingXie commented Feb 27, 2024

Copy link
Copy Markdown
Contributor Author

I don't think another owner_not_claiming_slot like state would be needed in this case. The key difference here is that we are NOT creating slot coverage gaps in this case as in the slot migration case. Stale slot ownership is fine, in the sense that the request will get bounced multiple times, assuming node A's view is correct but then that is a different problem (and is indeed improved upon via owner_not_claiming_slot).

@madolson madolson left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I'm convinced this is a good solution. Most of the comments are minor.

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
PingXie and others added 4 commits February 27, 2024 10:57
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Comment thread src/cluster_legacy.c Outdated
PingXie and others added 2 commits February 27, 2024 15:41
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
@madolson

Copy link
Copy Markdown
Contributor

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c
@madolson madolson merged commit 28976a9 into redis:unstable Mar 5, 2024
@PingXie PingXie deleted the issue-13018 branch March 5, 2024 05:23
sundb pushed a commit to sundb/redis that referenced this pull request Jan 13, 2025
redis#13055)

This commit updates the processing of PONG gossip messages in the
cluster. When a node (B) becomes a replica due to a failover, its PONG
messages include its new primary node's (A) information and B's
configuration epoch is aligned with A's. This allows observer nodes to
identify changes in primary-ship, addressing issues of intermediate
states and enhancing cluster state consistency during topology changes.

Fix redis#13018
@sundb

sundb commented Jan 15, 2025

Copy link
Copy Markdown
Collaborator

mark this PR as 7.2 backport, which fixes an infinite loop in clusterNodeGetPrimary().
failed CI: https://github.com/redis/redis/actions/runs/12630886845/job/35204596944?pr=13729

YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
redis#13055)

This commit updates the processing of PONG gossip messages in the
cluster. When a node (B) becomes a replica due to a failover, its PONG
messages include its new primary node's (A) information and B's
configuration epoch is aligned with A's. This allows observer nodes to
identify changes in primary-ship, addressing issues of intermediate
states and enhancing cluster state consistency during topology changes.

Fix redis#13018
@YaacovHazan YaacovHazan mentioned this pull request Apr 22, 2025
@YaacovHazan YaacovHazan moved this from Todo to Done in Redis 7.2 Backport Apr 29, 2025
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request May 25, 2026
This commit updates the processing of PONG gossip messages in the
cluster. When a node (B) becomes a replica due to a failover, its PONG
messages include its new primary node's (A) information and B's
configuration epoch is aligned with A's. This allows observer nodes to
identify changes in primary-ship, addressing issues of intermediate
states and enhancing cluster state consistency during topology changes.

See redis#13055 for more details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Preventing temporary circular replication and slot loss in Redis Cluster Failover

7 participants