-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Closed
Description
This happened upon issuing a manual CLUSTER FAILOVER to the slave. It was the former master where the error occured, the slave is fine and the cluster check reports okay, so hooray for that :)
This was during a test setup with a client molesting the cluster with a number of SET/MGET. Before the manual failover the cluster looked like this:
Connecting to node 192.168.67.140:7001: OK
Connecting to node 192.168.67.140:7003: OK
Connecting to node 192.168.67.140:7002: OK
Connecting to node 192.168.67.140:7004: OK
>>> Performing Cluster Check (using node 192.168.67.140:7001)
M: d6f491085bba2bbcb651f84aee5aaef44b5714c4 192.168.67.140:7001
slots:2000-6461,10923-11921 (5461 slots) master
1 additional replica(s)
M: 1419cb638df1943b92ced1afc087d538fb92cbba 192.168.67.140:7003
slots:0-1999,6462-9923 (5462 slots) master
0 additional replica(s)
M: d5a68246ace4982f403f95f6a850d2192454c4db 192.168.67.140:7002
slots:9924-10922,11922-16383 (5461 slots) master
0 additional replica(s)
S: 3c97c3f44fe2a231335348e2d7cbdb81312115e9 192.168.67.140:7004
slots: (0 slots) slave
replicates d6f491085bba2bbcb651f84aee5aaef44b5714c4
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
And after the manual failover (and crash of node 1419cb638df1943b92ced1afc087d538fb92cbba) it looked like this:
Connecting to node 192.168.67.140:7002: OK
Connecting to node 192.168.67.140:7001: OK
Connecting to node 192.168.67.140:7004: OK
>>> Performing Cluster Check (using node 192.168.67.140:7002)
M: d5a68246ace4982f403f95f6a850d2192454c4db 192.168.67.140:7002
slots:9924-10922,11922-16383 (5461 slots) master
0 additional replica(s)
M: d6f491085bba2bbcb651f84aee5aaef44b5714c4 192.168.67.140:7001
slots:2000-6461,10923-11921 (5461 slots) master
0 additional replica(s)
M: 3c97c3f44fe2a231335348e2d7cbdb81312115e9 192.168.67.140:7004
slots:0-1999,6462-9923 (5462 slots) master
0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
Crash report:
21402:M 15 Jan 10:23:36.048 # Manual failover requested by slave 3c97c3f44fe2a231335348e2d7cbdb81312115e9.
21402:M 15 Jan 10:23:36.203 # Failover auth granted to 3c97c3f44fe2a231335348e2d7cbdb81312115e9 for epoch 9
21402:M 15 Jan 10:23:36.204 # Configuration change detected. Reconfiguring myself as a replica of 3c97c3f44fe2a231335348e2d7cbdb81312115e9
21402:S 15 Jan 10:23:36.204 # Connection with slave 192.168.67.140:7004 lost.
21402:S 15 Jan 10:23:36.204 #
=== REDIS BUG REPORT START: Cut & paste starting from here ===
21402:S 15 Jan 10:23:36.205 # === ASSERTION FAILED CLIENT CONTEXT ===
21402:S 15 Jan 10:23:36.205 # client->flags = 0
21402:S 15 Jan 10:23:36.205 # client->fd = 18
21402:S 15 Jan 10:23:36.205 # client->argc = 2
21402:S 15 Jan 10:23:36.205 # client->argv[0] = "GET" (refcount: 1)
21402:S 15 Jan 10:23:36.205 # client->argv[1] = "696" (refcount: 1)
21402:S 15 Jan 10:23:36.205 # === ASSERTION FAILED OBJECT CONTEXT ===
21402:S 15 Jan 10:23:36.205 # Object type: 0
21402:S 15 Jan 10:23:36.205 # Object encoding: 8
21402:S 15 Jan 10:23:36.205 # Object refcount: 1
21402:S 15 Jan 10:23:36.205 # Object raw string len: 3
21402:S 15 Jan 10:23:36.205 # Object raw string content: "696"
21402:S 15 Jan 10:23:36.205 # === ASSERTION FAILED ===
21402:S 15 Jan 10:23:36.205 # ==> cluster.c:4641 'n != NULL' is not true
21402:S 15 Jan 10:23:36.205 # (forcing SIGSEGV to print the bug report.)
21402:S 15 Jan 10:23:36.205 # Redis 2.9.102 crashed by signal: 11
21402:S 15 Jan 10:23:36.205 # Failed assertion: n != NULL (cluster.c:4641)
21402:S 15 Jan 10:23:36.205 # --- STACK TRACE
redis-server *:7003 [cluster](logStackTrace+0x3e)[0x4507be]
redis-server *:7003 [cluster](_redisAssert+0x6f)[0x44f78f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f0bbc518cb0]
redis-server *:7003 [cluster](_redisAssert+0x6f)[0x44f78f]
redis-server *:7003 [cluster](getNodeByQuery+0x3e5)[0x45a7c5]
redis-server *:7003 [cluster](processCommand+0x21d)[0x422acd]
redis-server *:7003 [cluster](processInputBuffer+0x51)[0x42cb01]
redis-server *:7003 [cluster](processUnblockedClients+0x5a)[0x46719a]
redis-server *:7003 [cluster](beforeSleep+0x112)[0x41e442]
redis-server *:7003 [cluster](aeMain+0x1e)[0x41acde]
redis-server *:7003 [cluster](main+0x334)[0x4199e4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f0bbc16a76d]
redis-server *:7003 [cluster][0x419b8d]
21402:S 15 Jan 10:23:36.208 # --- INFO OUTPUT
21402:S 15 Jan 10:23:36.208 # # Server
redis_version:2.9.102
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:1aa46b248d9d0fea
redis_mode:cluster
os:Linux 3.5.0-23-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:21402
run_id:27bacc2ba37e7b91306c430bd84ae8aae01318d3
tcp_port:7003
uptime_in_seconds:259
uptime_in_days:0
hz:10
lru_clock:12060072
config_file:/home/ingmar/redis3/redis.conf
# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:1029640
used_memory_human:1005.51K
used_memory_rss:3903488
used_memory_peak:2171840
used_memory_peak_human:2.07M
used_memory_lua:35840
mem_fragmentation_ratio:3.79
mem_allocator:jemalloc-3.6.0
# Persistence
loading:0
rdb_changes_since_last_save:48651
rdb_bgsave_in_progress:0
rdb_last_save_time:1421346149
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:3787424
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:89
total_commands_processed:246918
instantaneous_ops_per_sec:1312
total_net_input_bytes:7525025
total_net_output_bytes:3539185
instantaneous_input_kbps:35.76
instantaneous_output_kbps:32.09
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:117153
keyspace_misses:686
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:140
migrate_cached_sockets:0
# Replication
role:slave
master_host:192.168.67.140
master_port:7004
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1
master_link_down_since_seconds:1421346216
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:504936
repl_backlog_histlen:1048576
# CPU
used_cpu_sys:15.12
used_cpu_user:0.28
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Commandstats
cmdstat_get:calls=117153,usec=170044,usec_per_call=1.45
cmdstat_set:calls=117290,usec=236851,usec_per_call=2.02
cmdstat_select:calls=686,usec=721,usec_per_call=1.05
cmdstat_ping:calls=8,usec=8,usec_per_call=1.00
cmdstat_psync:calls=1,usec=252,usec_per_call=252.00
cmdstat_replconf:calls=68,usec=110,usec_per_call=1.62
cmdstat_info:calls=2,usec=84,usec_per_call=42.00
cmdstat_cluster:calls=11024,usec=42181,usec_per_call=3.83
cmdstat_restore-asking:calls=686,usec=2780,usec_per_call=4.05
# Cluster
cluster_enabled:1
# Keyspace
db0:keys=686,expires=0,avg_ttl=0
hash_init_value: 1421558963
21402:S 15 Jan 10:23:36.211 # --- CLIENT LIST OUTPUT
21402:S 15 Jan 10:23:36.211 # id=7 addr=192.168.67.1:65129 fd=18 name= age=201 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=22 obl=0 oll=0 omem=0 events=r cmd=get
21402:S 15 Jan 10:23:36.211 # --- CURRENT CLIENT INFO
21402:S 15 Jan 10:23:36.211 # client: id=7 addr=192.168.67.1:65129 fd=18 name= age=201 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=22 obl=0 oll=0 omem=0 events=r cmd=get
21402:S 15 Jan 10:23:36.211 # argv[0]: 'GET'
21402:S 15 Jan 10:23:36.211 # argv[1]: '696'
21402:S 15 Jan 10:23:36.211 # key '696' found in DB containing the following object:
21402:S 15 Jan 10:23:36.211 # Object type: 0
21402:S 15 Jan 10:23:36.211 # Object encoding: 1
21402:S 15 Jan 10:23:36.211 # Object refcount: 2
21402:S 15 Jan 10:23:36.211 # --- REGISTERS
21402:S 15 Jan 10:23:36.211 #
RAX:0000000000000000 RBX:0000000000001221
RCX:0000000000000001 RDX:0000000000000000
RDI:00007f0bbc503260 RSI:00007f0bbc504ab0
RBP:00000000004c2811 RSP:00007fff19391230
R8 :00007f0bbcc38780 R9 :0000000000000001
R10:7562206568742074 R11:0000000000000000
R12:00000000004bdddb R13:00007f0bbb581760
R14:0000000000000000 R15:0000000000000000
RIP:000000000044f78f EFL:0000000000010202
CSGSFS:0000000000000033
21402:S 15 Jan 10:23:36.211 # (00007fff1939123f) -> 00007fff19391348
21402:S 15 Jan 10:23:36.211 # (00007fff1939123e) -> 00007f0bbb5a0000
21402:S 15 Jan 10:23:36.211 # (00007fff1939123d) -> 00000000006ed9a0
21402:S 15 Jan 10:23:36.211 # (00007fff1939123c) -> 00007fff1939134c
21402:S 15 Jan 10:23:36.211 # (00007fff1939123b) -> 00007fff193912b0
21402:S 15 Jan 10:23:36.211 # (00007fff1939123a) -> 0000000000000000
21402:S 15 Jan 10:23:36.211 # (00007fff19391239) -> 0000000000000000
21402:S 15 Jan 10:23:36.211 # (00007fff19391238) -> 0000000000000000
21402:S 15 Jan 10:23:36.211 # (00007fff19391237) -> 0000000000000000
21402:S 15 Jan 10:23:36.211 # (00007fff19391236) -> 00007f0bbb571e30
21402:S 15 Jan 10:23:36.211 # (00007fff19391235) -> 00007f0bbb40e070
21402:S 15 Jan 10:23:36.211 # (00007fff19391234) -> 0000000000000010
21402:S 15 Jan 10:23:36.211 # (00007fff19391233) -> 000000000045a7c5
21402:S 15 Jan 10:23:36.211 # (00007fff19391232) -> 0000000000001e9d
21402:S 15 Jan 10:23:36.211 # (00007fff19391231) -> 0000000000000000
21402:S 15 Jan 10:23:36.211 # (00007fff19391230) -> 0000000000000000
21402:S 15 Jan 10:23:36.211 # --- FAST MEMORY TEST
21402:S 15 Jan 10:23:36.211 # Bio thread for job type #0 terminated
21402:S 15 Jan 10:23:36.212 # Bio thread for job type #1 terminated
Testing 6f2000 90112
Testing 1a9a000 135168
Testing 7f0bba3ff000 8388608
Testing 7f0bbac00000 12582912
Testing 7f0bbbc00000 4194304
Testing 7f0bbc504000 20480
Testing 7f0bbc722000 16384
Testing 7f0bbcc38000 16384
Testing 7f0bbcc40000 16384
21402:S 15 Jan 10:23:36.404 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
21402:S 15 Jan 10:23:36.404 #
=== REDIS BUG REPORT END. Make sure to include from START to END. ===
The test script was very simple:
from rediscluster.client import RedisCluster
from time import time
startup_nodes = [
{'host': '192.168.67.140', 'port': '7002'},
{'host': '192.168.67.140', 'port': '7004'},
]
rc = RedisCluster(startup_nodes=startup_nodes, decode_responses=True)
while True:
t = time()
for i in range(2048):
rc.set(i, i)
l = rc.mget(list(range(2048)))
duration = time() - t
print len(l), "%f ms" % (duration * 1000.0)