Skip to content

Conversation

@oranagra
Copy link
Member

Recently the cluster tests are consistently failing when executed with ASAN in the CI.
The failure is usually in 04-resharding.tcl:

00:42:54> Verify 50000 keys for consistency with logical content: FAILED: caught an error in the test CLUSTERDOWN The cluster is down

when it happens with test-sanitizer-address (gcc)
we can later see this:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
38902:M 27 Mar 2022 02:33:24.780 # Redis 255.255.255 crashed by signal: 11, si_code: 0
38902:M 27 Mar 2022 02:33:24.780 # Accessing address: 0x3e90000b308
38902:M 27 Mar 2022 02:33:24.781 # Killed by PID: 45832, UID: 1001
38902:M 27 Mar 2022 02:33:24.821 # Crashed running the instruction at: 0x7f04a759789b

------ STACK TRACE ------
EIP:
/lib/x86_64-linux-gnu/libc.so.6(__sched_yield+0xb)[0x7f04a759789b]

Backtrace:
../../../src/redis-server *:30000 [cluster](sigsegvHandler+0x1d2)[0x56170ae5e7b2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f04a76993c0]
/lib/x86_64-linux-gnu/libc.so.6(__sched_yield+0xb)[0x7f04a759789b]
/lib/x86_64-linux-gnu/libasan.so.5(+0x1323f5)[0x7f04a792e3f5]
/lib/x86_64-linux-gnu/libasan.so.5(+0x13b874)[0x7f04a7[937](https://github.com/redis/redis/runs/5706456172?check_suite_focus=true#step:9:937)874]
/lib/x86_64-linux-gnu/libc.so.6(dl_iterate_phdr+0x185)[0x7f04a75f4375]
/lib/x86_64-linux-gnu/libasan.so.5(+0x13bbd0)[0x7f04a7937bd0]
/lib/x86_64-linux-gnu/libasan.so.5(+0x13b031)[0x7f04a7937031]
/lib/x86_64-linux-gnu/libasan.so.5(+0x13b3b9)[0x7f04a79373b9]
/lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f04a74dc15e]
/lib/x86_64-linux-gnu/libasan.so.5(+0x22be7)[0x7f04a781ebe7]

when it happens with test-sanitizer-address (clang) we see:

46928:M 28 Mar 2022 01:11:25.753 # Redis 255.255.255 crashed by signal: 11, si_code: 0
46928:M 28 Mar 2022 01:11:25.753 # Accessing address: 0x3e90000c3dc
46928:M 28 Mar 2022 01:11:25.754 # Killed by PID: 50140, UID: 1001
46928:M 28 Mar 2022 01:11:25.754 # Crashed running the instruction at: 0x7f2a964f289b

------ STACK TRACE ------
EIP:
/lib/x86_64-linux-gnu/libc.so.6(__sched_yield+0xb)[0x7f2a964f289b]

Backtrace:
../../../src/redis-server *:30000 [cluster](sigsegvHandler+0x193)[0x688c33]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f2a9660f3c0]
/lib/x86_64-linux-gnu/libc.so.6(__sched_yield+0xb)[0x7f2a964f289b]
../../../src/redis-server *:30000 [cluster][0x4e8765]
../../../src/redis-server *:30000 [cluster][0x4f32ba]
/lib/x86_64-linux-gnu/libc.so.6(dl_iterate_phdr+0x185)[0x7f2a9654f375]
../../../src/redis-server *:30000 [cluster][0x4f328f]
../../../src/redis-server *:30000 [cluster][0x4f0a58]
../../../src/redis-server *:30000 [cluster][0x4f09c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x49a27)[0x7f2a96436a27]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x7f2a96436be0]
../../../src/redis-server *:30000 [cluster](serverCron+0xfa8)[0x5267b8]
../../../src/redis-server *:30000 [cluster](aeProcessEvents+0xb0a)[0x515eba]
../../../src/redis-server *:30000 [cluster](aeMain+0x3d)[0x5165fd]
../../../src/redis-server *:30000 [cluster](main+0xb81)[0x540251]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f2a964140b3]
../../../src/redis-server *:30000 [cluster](_start+0x2e)[0x44f9ee]

these traces indicate that the test suite infra is attempting to terminate the redis instance, and when it refuses to terminate, the test suite sends a SIGSEGV in order to see where it is hung.
we see it is hung inside exit

I tried to track down the commit that started it, and it appears to be #10293.
Looking at the commit, i realize it didn't affect this test / flow, other than the replacement of the slots_info_pairs from sds to list.
i concluded that what could be happening is that the slot range is very fragmented, and that results in many allocations.
with sds, it results in one allocation and also, we have a greedy growth mechanism, but with adlist, we just have many many small allocations.
this probably causes stress on ASAN, and causes it to be slow at termination.

This commit improve malloc efficiency of this mechanism by changing adlist into an array being realloced with greedy growth mechanism
.
tests: https://github.com/redis/redis/actions/runs/2052717847

@oranagra oranagra requested a review from madolson March 28, 2022 14:15
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
@oranagra oranagra merged commit 3b1e65a into redis:unstable Mar 29, 2022
@oranagra oranagra deleted the cluster_slot_info_pairs branch March 29, 2022 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants