Check shard_id pointer validity in updateShardId #12538

secwall · 2023-09-01T11:56:51Z

Hello. While testing upgrade on sharded cluster from 7.0 to 7.2 we got the segmentation fault in updateShardId.
It seems that 7.0 nodes have no shard id in ping extensions.
Backtrace with gdb:

gdb /tmp/redis-server-asan /var/cores/redis-server-asan-26343-S11.core
Reading symbols from /tmp/redis-server-asan...
[New LWP 26343]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/tmp/redis-server-asan *:6379 [cluster]     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f139f43f75b in kill () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f139f43f75b in kill () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000555c46cd6b4f in bugReportEnd (killViaSignal=1, sig=11) at /home/secwall/redis/src/debug.c:2224
#2  <signal handler called>
#3  0x0000555c46ce10eb in updateShardId (node=0x555c47df2b20, shard_id=0x0) at /home/secwall/redis/src/cluster.c:941
#4  0x0000555c46cec076 in clusterProcessPacket (link=link@entry=0x555c47df0990) at /home/secwall/redis/src/cluster.c:3099
#5  0x0000555c46ceca96 in clusterReadHandler (conn=0x555c47df09f0) at /home/secwall/redis/src/cluster.c:3389
#6  0x0000555c46d77e6c in callHandler (handler=<optimized out>, conn=0x555c47df09f0) at /home/secwall/redis/src/connhelpers.h:79
#7  connSocketEventHandler (el=<optimized out>, fd=<optimized out>, clientData=0x555c47df09f0, mask=<optimized out>) at /home/secwall/redis/src/socket.c:298
#8  0x0000555c46c41469 in aeProcessEvents (flags=27, eventLoop=0x555c47d46670) at /home/secwall/redis/src/ae.c:436
#9  aeMain (eventLoop=0x555c47d46670) at /home/secwall/redis/src/ae.c:496
#10 0x0000555c46c35b9f in main (argc=2, argv=<optimized out>) at /home/secwall/redis/src/server.c:7378

This should fix issue #12507

madolson

I think this is comprehensive, but I'm going to test it a bit before merging.

madolson · 2023-09-03T02:24:49Z

Ok, I think the best way to test this would have been #10214. Probably worth revisiting implementing this.

When connecting between a 7.0 and 7.2 cluster, the 7.0 cluster will not populate the shard_id field, which is expect on the 7.2 cluster. This is not intended behavior, as the 7.2 cluster is supposed to use a temporary shard_id while the node is in the upgrading state, but it wasn't being correctly set in this case. (cherry picked from commit a2046c1)

…d 7.2 In redis#10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like redis#12538. In this PR, we remove the assert and replace it with plain if. Fixes redis#12603.

…d 7.2 (#12604) In #10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like #12538. In this PR, we remove the assert and replace it with an unconditional removal.

…d 7.2 (#12604) In #10536, we introduced the assert, some older versions of servers (like 7.0) doesn't gossip shard_id, so we will not add the node to cluster->shards, and node->shard_id is filled in randomly and may not be found here. It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow like #12538. In this PR, we remove the assert and replace it with an unconditional removal. (cherry picked from commit e5ef161)

Check shard_id pointer validity in updateShardId

918b4ee

madolson reviewed Sep 1, 2023

View reviewed changes

madolson approved these changes Sep 3, 2023

View reviewed changes

madolson added the release-notes indication that this issue needs to be mentioned in the release notes label Sep 3, 2023

madolson linked an issue Sep 3, 2023 that may be closed by this pull request

[CRASH] Redis cluster 7.2.0 crashed by signal: 11, si_code: 1 when joining existing cluster (ubuntu 20.04) #12507

Closed

madolson merged commit a2046c1 into redis:unstable Sep 3, 2023

oranagra mentioned this pull request Sep 6, 2023

Redis 7.2.1 #12553

Merged

enjoy-binbin mentioned this pull request Sep 23, 2023

Fix crash when running rebalance command in a mixed cluster of 7.0 and 7.2 #12604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check shard_id pointer validity in updateShardId #12538

Check shard_id pointer validity in updateShardId #12538

Uh oh!

secwall commented Sep 1, 2023

Uh oh!

madolson left a comment

Uh oh!

madolson commented Sep 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Check shard_id pointer validity in updateShardId #12538

Check shard_id pointer validity in updateShardId #12538

Uh oh!

Conversation

secwall commented Sep 1, 2023

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

madolson commented Sep 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants