Skip to content

Change the same shard failover assert to if condition to avoid crash#2431

Merged
enjoy-binbin merged 6 commits into
valkey-io:unstablefrom
enjoy-binbin:remove_assert
Aug 12, 2025
Merged

Change the same shard failover assert to if condition to avoid crash#2431
enjoy-binbin merged 6 commits into
valkey-io:unstablefrom
enjoy-binbin:remove_assert

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Aug 5, 2025

Copy link
Copy Markdown
Member

The assert was added in #2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:

  1. sender_claimed_primary and sender are in the same shard
  2. and sender is the old primary, sender_claimed_primary is the old replica
  3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from #2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:

  1. pre condition: two primaries in the same shard, one has slots and one is empty.
  2. replica doing a cluster failover
  3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes #2423.

Note that the test written here also exposes the issue in #2441, so these two
may need to be addressed together.

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 5, 2025
@enjoy-binbin

Copy link
Copy Markdown
Member Author

also see #2301 (review) for some details, i will add the test later tomorrow.

@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

The removal of this assert will fix: #2423.

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 6, 2025
… in advance

When writing test case for valkey-io#2431, i found that the corresponding test
case occasionally triggered this debug assert:
```
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is no longer primary of shard ab1e16d7e11cbb0d81acf0bab4191d84e19df092; removed all 5462 slot(s) it used to own
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now part of shard aad3ac4bf806c2bd5441ed5c421893be20f02797
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now a replica of node a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 () in shard aad3ac4bf806c2bd5441ed5c421893be20f02797

=== VALKEY BUG REPORT START: Cut & paste starting from here ===
=== ASSERTION FAILED ===
==> cluster_legacy.c:7004 'primary->replicaof == ((void*)0)' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010507ac70 clusterUpdateSlotsConfigWith + 0
1   valkey-server                       0x0000000105077564 clusterReadHandler + 6180
2   valkey-server                       0x00000001050f3b0c connSocketEventHandler + 180
3   valkey-server                       0x0000000104fb7f9c aeProcessEvents + 372
4   valkey-server                       0x0000000104fe2d30 main + 27676
5   dyld                                0x0000000199177154 start + 2476
```

And we can see from the cluster nodes output, there are indeed a replication loop:
```
sender: ef54ff291855537b7f4f61c815955310e55bc9e2
sender_claimed_primary: a1d09af28a10d58deb43ccc8bddd8c8d095a5af5

ef54ff291855537b7f4f61c815955310e55bc9e2 shard-id=ab1e16d7e11cbb0d81acf0bab4191d84e19df092 slave a1d09af28a10d58deb43ccc8bddd8c8d095a5af5
a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 shard-id=aad3ac4bf806c2bd5441ed5c421893be20f02797 slave ef54ff291855537b7f4f61c815955310e55bc9e2
```

The sender and sender_claimed_primary are not in the same shard, and
sender_claimed_primary is a replica in myself view, if we don't update
sender_claimed_primary role, in the next `sender->replicaof = sender_claimed_primary`,
we will create the replication loop.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@madolson madolson moved this to Optional for next release candidate in Valkey 9.0 Aug 6, 2025
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin requested a review from PingXie August 7, 2025 08:20
@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

@enjoy-binbin @PingXie not sure what happened recently, but the assertion is failing in some of the PR runs now. Eg: https://github.com/valkey-io/valkey/actions/runs/16834099551/job/47689042910?pr=2452

It would be good to have this PR merged.

@enjoy-binbin

Copy link
Copy Markdown
Member Author

Just waitting for Ping to review it. There is also another issue need to be fix: #2441. The test i write in this PR will trigger #2441 issue, so these two might need to be fix together.

Or the other way is i can merge this PR first and then leave #2441 for a while since it is a debugAssert. But let's wait @PingXie for another one or two days. I will ping he in the private thread.

Comment thread src/cluster_legacy.c
@madolson madolson moved this from Optional for next release candidate to Todo in Valkey 9.0 Aug 11, 2025

@PingXie PingXie left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change make senses. basically the removed assert will fire for any "multiple primaries" case and #2301 could just be one of them.

Comment thread tests/unit/cluster/manual-failover.tcl Outdated
Comment thread tests/unit/cluster/manual-failover.tcl Outdated
Comment thread tests/unit/cluster/manual-failover.tcl Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin merged commit 7a9ef29 into valkey-io:unstable Aug 12, 2025
13 of 57 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in Valkey 9.0 Aug 12, 2025
@enjoy-binbin enjoy-binbin deleted the remove_assert branch August 12, 2025 02:37
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 13, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test
cause some trouble, now we just remove the assert (if condition)
and disable the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Aug 13, 2025
In #2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to #2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit to rjd15372/valkey that referenced this pull request Sep 19, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
In #2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to #2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin moved this to To be backported in Valkey 8.0 Sep 30, 2025
@enjoy-binbin enjoy-binbin moved this to To be backported in Valkey 8.1 Sep 30, 2025
@zuiderkwast zuiderkwast moved this from To be backported to Don't backport yet in Valkey 8.0 Sep 30, 2025
@ranshid ranshid moved this from To be backported to Don't backport yet in Valkey 8.1 Sep 30, 2025
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Don't backport yet
Status: Don't backport yet
Status: Done

Development

Successfully merging this pull request may close these issues.

[test-failure] AddressSanitizer test failure in cluster

6 participants