Skip to content

Conversation

@secwall
Copy link
Contributor

@secwall secwall commented Sep 1, 2023

Hello. While testing upgrade on sharded cluster from 7.0 to 7.2 we got the segmentation fault in updateShardId.
It seems that 7.0 nodes have no shard id in ping extensions.
Backtrace with gdb:

gdb /tmp/redis-server-asan /var/cores/redis-server-asan-26343-S11.core
Reading symbols from /tmp/redis-server-asan...
[New LWP 26343]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/tmp/redis-server-asan *:6379 [cluster]     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f139f43f75b in kill () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f139f43f75b in kill () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000555c46cd6b4f in bugReportEnd (killViaSignal=1, sig=11) at /home/secwall/redis/src/debug.c:2224
#2  <signal handler called>
#3  0x0000555c46ce10eb in updateShardId (node=0x555c47df2b20, shard_id=0x0) at /home/secwall/redis/src/cluster.c:941
#4  0x0000555c46cec076 in clusterProcessPacket (link=link@entry=0x555c47df0990) at /home/secwall/redis/src/cluster.c:3099
#5  0x0000555c46ceca96 in clusterReadHandler (conn=0x555c47df09f0) at /home/secwall/redis/src/cluster.c:3389
#6  0x0000555c46d77e6c in callHandler (handler=<optimized out>, conn=0x555c47df09f0) at /home/secwall/redis/src/connhelpers.h:79
#7  connSocketEventHandler (el=<optimized out>, fd=<optimized out>, clientData=0x555c47df09f0, mask=<optimized out>) at /home/secwall/redis/src/socket.c:298
#8  0x0000555c46c41469 in aeProcessEvents (flags=27, eventLoop=0x555c47d46670) at /home/secwall/redis/src/ae.c:436
#9  aeMain (eventLoop=0x555c47d46670) at /home/secwall/redis/src/ae.c:496
#10 0x0000555c46c35b9f in main (argc=2, argv=<optimized out>) at /home/secwall/redis/src/server.c:7378

This should fix issue #12507

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is comprehensive, but I'm going to test it a bit before merging.

@madolson
Copy link
Contributor

madolson commented Sep 3, 2023

Ok, I think the best way to test this would have been #10214. Probably worth revisiting implementing this.

@madolson madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Sep 3, 2023
@madolson madolson merged commit a2046c1 into redis:unstable Sep 3, 2023
@oranagra oranagra mentioned this pull request Sep 6, 2023
oranagra pushed a commit that referenced this pull request Sep 6, 2023
When connecting between a 7.0 and 7.2 cluster, the 7.0 cluster will not populate the shard_id field, which is expect on the 7.2 cluster. This is not intended behavior, as the 7.2 cluster is supposed to use a temporary shard_id while the node is in the upgrading state, but it wasn't being correctly set in this case.

(cherry picked from commit a2046c1)
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Sep 23, 2023
…d 7.2

In redis#10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like redis#12538.

In this PR, we remove the assert and replace it with plain if.
Fixes redis#12603.
madolson pushed a commit that referenced this pull request Oct 12, 2023
…d 7.2 (#12604)

In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.

In this PR, we remove the assert and replace it with an unconditional removal.
oranagra pushed a commit that referenced this pull request Oct 18, 2023
…d 7.2 (#12604)

In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.

It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.

In this PR, we remove the assert and replace it with an unconditional removal.

(cherry picked from commit e5ef161)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[CRASH] Redis cluster 7.2.0 crashed by signal: 11, si_code: 1 when joining existing cluster (ubuntu 20.04)

2 participants