Fix argument getpeername#6
Merged
kostja merged 1 commit intotarantool:masterfrom Nov 2, 2012
zloidemon:master
Merged
Conversation
This was referenced Aug 15, 2014
Closed
zloidemon
added a commit
that referenced
this pull request
Mar 24, 2015
Closed
lenkis
added a commit
that referenced
this pull request
Jan 25, 2016
Revised the entire tutorial (file help_en_US.lua): * Tested and fixed all sample code snippets. * Corrected typos and inconsistent names in screen headers. * Updated info about custom delimiters (screen #6). * Replaced box.session with fiber (example on screen #10). * Swapped screens #11/#12 (fibers/sockets) for smoother story flow. Removed help-related stub logic (file help.lua).
lenkis
added a commit
that referenced
this pull request
Jan 26, 2016
Revised the entire tutorial (file help_en_US.lua): * Tested and fixed all sample code snippets. * Corrected typos and inconsistent names in screen headers. * Updated info about custom delimiters (screen #6). * Replaced box.session with fiber (example on screen #10). * Swapped screens #11/#12 (fibers/sockets) for smoother story flow. Removed help-related stub logic (file help.lua). Updated auto-tests to expect a new output result on help call.
bigbes
pushed a commit
that referenced
this pull request
Jan 26, 2016
Revised the entire tutorial (file help_en_US.lua): * Tested and fixed all sample code snippets. * Corrected typos and inconsistent names in screen headers. * Updated info about custom delimiters (screen #6). * Replaced box.session with fiber (example on screen #10). * Swapped screens #11/#12 (fibers/sockets) for smoother story flow. Removed help-related stub logic (file help.lua). Updated auto-tests to expect a new output result on help call.
Closed
This was referenced Jun 7, 2016
Closed
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 7, 2023
We may need to cancel fiber that waits for cord to finish. For this
purpose let's cancel fiber started by cord_costart inside the cord.
Note that there is a race between stopping cancel_event in cord and
triggering it using ev_async_send in joining thread. AFAIU it is safe.
We also need to fix stopping wal cord to address stack-use-after-return
issue shown below. Is arises because we did not stop async which resides
in wal endpoint and endpoint resides on stack. Later when we stop the
introduced cancel_event we access not stopped async which at this moment
gone out of scope.
```
==3224698==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f654b3b0170 at pc 0x555a2817c282 bp 0x7f654ca55b30 sp 0x7f654ca55b28
WRITE of size 4 at 0x7f654b3b0170 thread T3
#0 0x555a2817c281 in ev_async_stop /home/shiny/dev/tarantool/third_party/libev/ev.c:5492:37
tarantool#1 0x555a27827738 in cord_thread_func /home/shiny/dev/tarantool/src/lib/core/fiber.c:1990:2
tarantool#2 0x7f65574aa9ea in start_thread /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444:8
tarantool#3 0x7f655752e7cb in clone3 /usr/src/debug/glibc/glibc/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```
But after that we also need to temporarily comment freeing applier
threads. The issue is appiler_free goes first and then wal_free. The
first does not correctly free resources. In particular does not destroy
thread endpoint but frees its memory. As a result we got use-after-free
on destroying wal endpoint.
```
==3508646==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b000001a30 at pc 0x5556ff1b08d8 bp 0x7f69cb7f65c0 sp 0x7f69cb7f65b8
WRITE of size 8 at 0x61b000001a30 thread T3
#0 0x5556ff1b08d7 in rlist_del /home/shiny/dev/tarantool/src/lib/small/include/small/rlist.h:101:19
tarantool#1 0x5556ff1b08d7 in cbus_endpoint_destroy /home/shiny/dev/tarantool/src/lib/core/cbus.c:256:2
tarantool#2 0x5556feea1f2c in wal_writer_f /home/shiny/dev/tarantool/src/box/wal.c:1237:2
tarantool#3 0x5556fea3eb57 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1297:10
tarantool#4 0x5556ff19af3e in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1160:18
tarantool#5 0x5556ffb0fbd2 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3
0x61b000001a30 is located 1200 bytes inside of 1528-byte region [0x61b000001580,0x61b000001b78)
freed by thread T0 here:
#0 0x5556fe9ed8a2 in __interceptor_free.part.0 asan_malloc_linux.cpp.o
tarantool#1 0x5556fee4ef65 in applier_free /home/shiny/dev/tarantool/src/box/applier.cc:2175:3
tarantool#2 0x5556fedfce01 in box_storage_free() /home/shiny/dev/tarantool/src/box/box.cc:5869:2
tarantool#3 0x5556fedfce01 in box_free /home/shiny/dev/tarantool/src/box/box.cc:5936:2
tarantool#4 0x5556fea3cfec in tarantool_free() /home/shiny/dev/tarantool/src/main.cc:575:2
tarantool#5 0x5556fea3cfec in main /home/shiny/dev/tarantool/src/main.cc:1087:2
tarantool#6 0x7f69d7445ccf in __libc_start_call_main /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```
Part of tarantool#8423
NO_DOC=internal
NO_CHANGELOG=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 7, 2023
We may need to cancel fiber that waits for cord to finish. For this
purpose let's cancel fiber started by cord_costart inside the cord.
Note that there is a race between stopping cancel_event in cord and
triggering it using ev_async_send in joining thread. AFAIU it is safe.
We also need to fix stopping wal cord to address stack-use-after-return
issue shown below. Is arises because we did not stop async which resides
in wal endpoint and endpoint resides on stack. Later when we stop the
introduced cancel_event we access not stopped async which at this moment
gone out of scope.
```
==3224698==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f654b3b0170 at pc 0x555a2817c282 bp 0x7f654ca55b30 sp 0x7f654ca55b28
WRITE of size 4 at 0x7f654b3b0170 thread T3
#0 0x555a2817c281 in ev_async_stop /home/shiny/dev/tarantool/third_party/libev/ev.c:5492:37
tarantool#1 0x555a27827738 in cord_thread_func /home/shiny/dev/tarantool/src/lib/core/fiber.c:1990:2
tarantool#2 0x7f65574aa9ea in start_thread /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444:8
tarantool#3 0x7f655752e7cb in clone3 /usr/src/debug/glibc/glibc/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```
But after that we also need to temporarily comment freeing applier
threads. The issue is appiler_free goes first and then wal_free. The
first does not correctly free resources. In particular does not destroy
thread endpoint but frees its memory. As a result we got use-after-free
on destroying wal endpoint.
```
==3508646==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b000001a30 at pc 0x5556ff1b08d8 bp 0x7f69cb7f65c0 sp 0x7f69cb7f65b8
WRITE of size 8 at 0x61b000001a30 thread T3
#0 0x5556ff1b08d7 in rlist_del /home/shiny/dev/tarantool/src/lib/small/include/small/rlist.h:101:19
tarantool#1 0x5556ff1b08d7 in cbus_endpoint_destroy /home/shiny/dev/tarantool/src/lib/core/cbus.c:256:2
tarantool#2 0x5556feea1f2c in wal_writer_f /home/shiny/dev/tarantool/src/box/wal.c:1237:2
tarantool#3 0x5556fea3eb57 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1297:10
tarantool#4 0x5556ff19af3e in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1160:18
tarantool#5 0x5556ffb0fbd2 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3
0x61b000001a30 is located 1200 bytes inside of 1528-byte region [0x61b000001580,0x61b000001b78)
freed by thread T0 here:
#0 0x5556fe9ed8a2 in __interceptor_free.part.0 asan_malloc_linux.cpp.o
tarantool#1 0x5556fee4ef65 in applier_free /home/shiny/dev/tarantool/src/box/applier.cc:2175:3
tarantool#2 0x5556fedfce01 in box_storage_free() /home/shiny/dev/tarantool/src/box/box.cc:5869:2
tarantool#3 0x5556fedfce01 in box_free /home/shiny/dev/tarantool/src/box/box.cc:5936:2
tarantool#4 0x5556fea3cfec in tarantool_free() /home/shiny/dev/tarantool/src/main.cc:575:2
tarantool#5 0x5556fea3cfec in main /home/shiny/dev/tarantool/src/main.cc:1087:2
tarantool#6 0x7f69d7445ccf in __libc_start_call_main /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```
Part of tarantool#8423
NO_DOC=internal
NO_CHANGELOG=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 7, 2023
We may need to cancel fiber that waits for cord to finish. For this
purpose let's cancel fiber started by cord_costart inside the cord.
Note that there is a race between stopping cancel_event in cord and
triggering it using ev_async_send in joining thread. AFAIU it is safe.
We also need to fix stopping wal cord to address stack-use-after-return
issue shown below. Is arises because we did not stop async which resides
in wal endpoint and endpoint resides on stack. Later when we stop the
introduced cancel_event we access not stopped async which at this moment
gone out of scope.
```
==3224698==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f654b3b0170 at pc 0x555a2817c282 bp 0x7f654ca55b30 sp 0x7f654ca55b28
WRITE of size 4 at 0x7f654b3b0170 thread T3
#0 0x555a2817c281 in ev_async_stop /home/shiny/dev/tarantool/third_party/libev/ev.c:5492:37
tarantool#1 0x555a27827738 in cord_thread_func /home/shiny/dev/tarantool/src/lib/core/fiber.c:1990:2
tarantool#2 0x7f65574aa9ea in start_thread /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444:8
tarantool#3 0x7f655752e7cb in clone3 /usr/src/debug/glibc/glibc/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```
But after that we need to properly destroy endpoints in other threads
too. Check for example ASAN report for applier thread below. The issue
is applier endpoind is linked with wal endpoint and applier endpoint
memory is freed (without proper destroying) when we destroy wal endpoint.
The similar issue is with endpoints in vinyl threads. However we got
SIGSIGEV with them instead of proper ASAN report. Looks like the cause
is vinyl endpoints reside on stack. In case of applier we can just
temporarily comment freeing applier thread memory until proper applier
shutdown for the sake of this patch. But we can't do the same way for
vinyl threads. Let's just stop cbus_loop in both cases. It is not full
shutdown solution not for applier nor for vinyl as both may have running
fibers in threads. It is temporary solution just for this patch. We add
missing pieces in later patches.
```
==3508646==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b000001a30 at pc 0x5556ff1b08d8 bp 0x7f69cb7f65c0 sp 0x7f69cb7f65b8
WRITE of size 8 at 0x61b000001a30 thread T3
#0 0x5556ff1b08d7 in rlist_del /home/shiny/dev/tarantool/src/lib/small/include/small/rlist.h:101:19
tarantool#1 0x5556ff1b08d7 in cbus_endpoint_destroy /home/shiny/dev/tarantool/src/lib/core/cbus.c:256:2
tarantool#2 0x5556feea1f2c in wal_writer_f /home/shiny/dev/tarantool/src/box/wal.c:1237:2
tarantool#3 0x5556fea3eb57 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1297:10
tarantool#4 0x5556ff19af3e in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1160:18
tarantool#5 0x5556ffb0fbd2 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3
0x61b000001a30 is located 1200 bytes inside of 1528-byte region [0x61b000001580,0x61b000001b78)
freed by thread T0 here:
#0 0x5556fe9ed8a2 in __interceptor_free.part.0 asan_malloc_linux.cpp.o
tarantool#1 0x5556fee4ef65 in applier_free /home/shiny/dev/tarantool/src/box/applier.cc:2175:3
tarantool#2 0x5556fedfce01 in box_storage_free() /home/shiny/dev/tarantool/src/box/box.cc:5869:2
tarantool#3 0x5556fedfce01 in box_free /home/shiny/dev/tarantool/src/box/box.cc:5936:2
tarantool#4 0x5556fea3cfec in tarantool_free() /home/shiny/dev/tarantool/src/main.cc:575:2
tarantool#5 0x5556fea3cfec in main /home/shiny/dev/tarantool/src/main.cc:1087:2
tarantool#6 0x7f69d7445ccf in __libc_start_call_main /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```
Part of tarantool#8423
NO_DOC=internal
NO_CHANGELOG=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 8, 2023
Here is the issue with replication shutdown. It crashes if bootstrap is in progress. Bootstrap uses resume_to_state API to wait for applier state of interest. resume_to_state usually pause applier fiber on applier stop/off and wakes bootstrap fiber to pass it the error. But if applier fiber is cancelled like when shutdown then applier_pause returns immediately. Which leads to the assertion later. Even if we ignore this assertion somehow then we hit assertion in bootstrap fiber in applier_wait_for_state as it expects diag set in case of off/stop state. But diag is get eaten by applier fiber join in shutdown fiber. On applier error if there is fiber using resume_to_state API we first suspend applier fiber and only exit it when applier fiber get canceled on applier stop. AFAIU this complicated mechanics to keep fiber alive on errors is only to keep fiber diag for bootstrap fiber. Instead let's move diag for bootstrap out of fiber. Now we don't need to keep fiber alive on errors. The current approach on passing diag has other oddities. Like we won't finish disconnect on errors immediately. Or have to return from applier_f -1 or 0 on errors depending on do we expect another fiber to steal fiber diag or not. Part of tarantool#8423 NO_TEST=rely on existing tests NO_CHANGELOG=internal NO_DOC=internal Shutdown fiber stack: ``` tarantool#5 0x00007f84f0c54d26 in __assert_fail ( assertion=0x564ffd5dfdec "fiber() == applier->fiber", file=0x564ffd5dedae "./src/box/applier.cc", line=2809, function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60, event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0, state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83 tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0) at /<snap>/tarantool/src/box/applier.cc:2749 tarantool#13 0x0000564ffd08b8c7 in replication_shutdown () ``` Bootstrap fiber stack: ``` tarantool#1 0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true) at /<snap>/tarantool/src/lib/core/fiber.c:863 tarantool#2 0x0000564ffd1e0caa in fiber_yield () at /<snap>/tarantool/src/lib/core/fiber.c:870 tarantool#3 0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000) at /<snap>/tarantool/src/lib/core/fiber.c:914 tarantool#4 0x0000564ffd1ed401 in fiber_cond_wait_timeout (c=0x7f84f0780aa8, timeout=3153600000) at /<snap>/tarantool/src/lib/core/fiber_cond.c:107 tarantool#5 0x0000564ffd1ed61c in fiber_cond_wait_deadline (c=0x7f84f0780aa8, deadline=3155292512.1470647) at /<snap>/tarantool/src/lib/core/fiber_cond.c:128 tarantool#6 0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double) (trigger=0x7f84f0780a60, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2877 tarantool#7 0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double) (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2898 tarantool#8 0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0) at /<snap>/tarantool/src/box/box.cc:4980 tarantool#9 0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81) at /<snap>/tarantool/src/box/box.cc:5081 tarantool#10 0x0000564ffd07613d in box_cfg_xc() () at /<snap>/tarantool/src/box/box.cc:5427 ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 8, 2023
Here is the issue with replication shutdown. It crashes if bootstrap is in progress. Bootstrap uses resume_to_state API to wait for applier state of interest. resume_to_state usually pause applier fiber on applier stop/off and wakes bootstrap fiber to pass it the error. But if applier fiber is cancelled like when shutdown then applier_pause returns immediately. Which leads to the assertion later. Even if we ignore this assertion somehow then we hit assertion in bootstrap fiber in applier_wait_for_state as it expects diag set in case of off/stop state. But diag is get eaten by applier fiber join in shutdown fiber. On applier error if there is fiber using resume_to_state API we first suspend applier fiber and only exit it when applier fiber get canceled on applier stop. AFAIU this complicated mechanics to keep fiber alive on errors is only to keep fiber diag for bootstrap fiber. Instead let's move diag for bootstrap out of fiber. Now we don't need to keep fiber alive on errors. The current approach on passing diag has other oddities. Like we won't finish disconnect on errors immediately. Or have to return from applier_f -1 or 0 on errors depending on do we expect another fiber to steal fiber diag or not. Part of tarantool#8423 NO_TEST=rely on existing tests NO_CHANGELOG=internal NO_DOC=internal Shutdown fiber stack: ``` tarantool#5 0x00007f84f0c54d26 in __assert_fail ( assertion=0x564ffd5dfdec "fiber() == applier->fiber", file=0x564ffd5dedae "./src/box/applier.cc", line=2809, function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60, event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0, state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83 tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0) at /<snap>/tarantool/src/box/applier.cc:2749 tarantool#13 0x0000564ffd08b8c7 in replication_shutdown () ``` Bootstrap fiber stack: ``` tarantool#1 0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true) at /<snap>/tarantool/src/lib/core/fiber.c:863 tarantool#2 0x0000564ffd1e0caa in fiber_yield () at /<snap>/tarantool/src/lib/core/fiber.c:870 tarantool#3 0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000) at /<snap>/tarantool/src/lib/core/fiber.c:914 tarantool#4 0x0000564ffd1ed401 in fiber_cond_wait_timeout (c=0x7f84f0780aa8, timeout=3153600000) at /<snap>/tarantool/src/lib/core/fiber_cond.c:107 tarantool#5 0x0000564ffd1ed61c in fiber_cond_wait_deadline (c=0x7f84f0780aa8, deadline=3155292512.1470647) at /<snap>/tarantool/src/lib/core/fiber_cond.c:128 tarantool#6 0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double) (trigger=0x7f84f0780a60, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2877 tarantool#7 0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double) (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2898 tarantool#8 0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0) at /<snap>/tarantool/src/box/box.cc:4980 tarantool#9 0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81) at /<snap>/tarantool/src/box/box.cc:5081 tarantool#10 0x0000564ffd07613d in box_cfg_xc() () at /<snap>/tarantool/src/box/box.cc:5427 ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 15, 2023
Actually newly introduced 'replication/shutdown_test.lua' has an issue (all three subtests). The master is shutdown in time but replica crashes. Below are replica stacks during crash. The story is next. Master is shutdown and replica gets `SocketError` and sleeps on reconnect timeout. Now replica is shutdown. `applier_f` loop exits immediately without waking bootstrap fiber. Next we change state to OFF in shutdown fiber in `applier_stop` triggering `resume_to_state` trigger. It calls `applier_pause` that expects fiber to be applier fiber. So we'd better to wakeup bootstrap fiber in this case. But simple change to check cancel state after reconnection sleep does not work yet. This time `applier_pause` exits immediately which is not the way `resume_to_state` should work. See, if we return 0 from applier fiber then we clear fiber diag on fiber death which is not expected by `resume_to_state`. If we return -1, then `resume_to_state` will steal fiber diag which is not expected by `fiber_join` in `applier_stop`. TODO (check -1 case in experiment!) Even if we ignore this assertion somehow then we hit assertion in bootstrap fiber in applier_wait_for_state as it expects diag set in case of off/stop state. But diag is get eaten by applier fiber join in shutdown fiber. On applier error if there is fiber using resume_to_state API we first suspend applier fiber and only exit it when applier fiber get canceled on applier stop. AFAIU this complicated mechanics to keep fiber alive on errors is only to keep fiber diag for bootstrap fiber. Instead let's move diag for bootstrap out of fiber. Now we don't need to keep fiber alive on errors. The current approach on passing diag has other oddities. Like we won't finish disconnect on errors immediately. Or have to return from applier_f -1 or 0 on errors depending on do we expect another fiber to steal fiber diag or not. Part of tarantool#8423 NO_TEST=rely on existing tests NO_CHANGELOG=internal NO_DOC=internal Shutdown fiber stack (crash): ``` tarantool#5 0x00007f84f0c54d26 in __assert_fail ( assertion=0x564ffd5dfdec "fiber() == applier->fiber", file=0x564ffd5dedae "./src/box/applier.cc", line=2809, function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60, event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0, state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83 tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0) at /<snap>/tarantool/src/box/applier.cc:2749 tarantool#13 0x0000564ffd08b8c7 in replication_shutdown () ``` Bootstrap fiber stack: ``` tarantool#1 0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true) at /<snap>/tarantool/src/lib/core/fiber.c:863 tarantool#2 0x0000564ffd1e0caa in fiber_yield () at /<snap>/tarantool/src/lib/core/fiber.c:870 tarantool#3 0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000) at /<snap>/tarantool/src/lib/core/fiber.c:914 tarantool#4 0x0000564ffd1ed401 in fiber_cond_wait_timeout (c=0x7f84f0780aa8, timeout=3153600000) at /<snap>/tarantool/src/lib/core/fiber_cond.c:107 tarantool#5 0x0000564ffd1ed61c in fiber_cond_wait_deadline (c=0x7f84f0780aa8, deadline=3155292512.1470647) at /<snap>/tarantool/src/lib/core/fiber_cond.c:128 tarantool#6 0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double) (trigger=0x7f84f0780a60, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2877 tarantool#7 0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double) (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2898 tarantool#8 0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0) at /<snap>/tarantool/src/box/box.cc:4980 tarantool#9 0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81) at /<snap>/tarantool/src/box/box.cc:5081 tarantool#10 0x0000564ffd07613d in box_cfg_xc() () at /<snap>/tarantool/src/box/box.cc:5427 ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 18, 2023
Actually newly introduced 'replication/shutdown_test.lua' has an issue (all three subtests). The master is shutdown in time but replica crashes. Below are replica stacks during crash. The story is next. Master is shutdown and replica gets `SocketError` and sleeps on reconnect timeout. Now replica is shutdown. `applier_f` loop exits immediately without waking bootstrap fiber. Next we change state to OFF in shutdown fiber in `applier_stop` triggering `resume_to_state` trigger. It calls `applier_pause` that expects fiber to be applier fiber. So we'd better to wakeup bootstrap fiber in this case. But simple change to check cancel state after reconnection sleep does not work yet. This time `applier_pause` exits immediately which is not the way `resume_to_state` should work. See, if we return 0 from applier fiber then we clear fiber diag on fiber death which is not expected by `resume_to_state`. If we return -1, then `resume_to_state` will steal fiber diag which is not expected by `fiber_join` in `applier_stop`. TODO (check -1 case in experiment!) Even if we ignore this assertion somehow then we hit assertion in bootstrap fiber in applier_wait_for_state as it expects diag set in case of off/stop state. But diag is get eaten by applier fiber join in shutdown fiber. On applier error if there is fiber using resume_to_state API we first suspend applier fiber and only exit it when applier fiber get canceled on applier stop. AFAIU this complicated mechanics to keep fiber alive on errors is only to keep fiber diag for bootstrap fiber. Instead let's move diag for bootstrap out of fiber. Now we don't need to keep fiber alive on errors. The current approach on passing diag has other oddities. Like we won't finish disconnect on errors immediately. Or have to return from applier_f -1 or 0 on errors depending on do we expect another fiber to steal fiber diag or not. Part of tarantool#8423 NO_TEST=rely on existing tests NO_CHANGELOG=internal NO_DOC=internal Shutdown fiber stack (crash): ``` tarantool#5 0x00007f84f0c54d26 in __assert_fail ( assertion=0x564ffd5dfdec "fiber() == applier->fiber", file=0x564ffd5dedae "./src/box/applier.cc", line=2809, function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60, event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0, state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83 tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0) at /<snap>/tarantool/src/box/applier.cc:2749 tarantool#13 0x0000564ffd08b8c7 in replication_shutdown () ``` Bootstrap fiber stack: ``` tarantool#1 0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true) at /<snap>/tarantool/src/lib/core/fiber.c:863 tarantool#2 0x0000564ffd1e0caa in fiber_yield () at /<snap>/tarantool/src/lib/core/fiber.c:870 tarantool#3 0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000) at /<snap>/tarantool/src/lib/core/fiber.c:914 tarantool#4 0x0000564ffd1ed401 in fiber_cond_wait_timeout (c=0x7f84f0780aa8, timeout=3153600000) at /<snap>/tarantool/src/lib/core/fiber_cond.c:107 tarantool#5 0x0000564ffd1ed61c in fiber_cond_wait_deadline (c=0x7f84f0780aa8, deadline=3155292512.1470647) at /<snap>/tarantool/src/lib/core/fiber_cond.c:128 tarantool#6 0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double) (trigger=0x7f84f0780a60, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2877 tarantool#7 0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double) (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000) at /<snap>/tarantool/src/box/applier.cc:2898 tarantool#8 0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0) at /<snap>/tarantool/src/box/box.cc:4980 tarantool#9 0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81) at /<snap>/tarantool/src/box/box.cc:5081 tarantool#10 0x0000564ffd07613d in box_cfg_xc() () at /<snap>/tarantool/src/box/box.cc:5427 ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 29, 2023
There is issue with graceful replication shutdown. A good example (on which `replication/shutdown_test.lua` is based) is bootstrapping a replica with wrong auth in replication URI. In this case applier is sleeping in reconnect delay and bootstap code waiting for READY state. Now comes server shutdown. Applier is stopped during shutdown and we hit assertion [1]. The issue is we don't expect the trigger set by `applier_resume_to_state` of bootstrap. We can clear the trigger as in other places calling `applier_disconnect(APPLIER_OFF)` in case applier is cancelled during reconect sleep. But this does not help. The issue is as applier fiber is cancelled it is returned immediately from `applier_disconnect` unlike the other cases. Now if we return 0 from applier thread then applier fiber diag is cleared on fiber termination and we got assertion in the trigger [2]. If we return -1 then diag is stealed by the trigger and we got assertion on applier fiber join [3]. AFAIU trigger installed by `applier_resume_to_state` on error pauses applier fiber so that error diag can be stealed by the function. Then on stop applier fiber is cancelled and exits with 0 so join does not expect a diag. Let's move from this solution of pausing fiber on error just to keep diag. Let the fiber finish on error but keep the diag in applier state. Also looks like we return -1 from applier fiber only to keep diag to be shown by `box.info.replication` as we never use the result of this fiber join. Now when we get this info from `applier->diag` we can return 0 as in other places. Part of tarantool#8423 NO_CHANGELOG=fixing unreleased issue NO_DOC=bugfix [1] Issue assertion stack: ``` tarantool#5 0x00007fe877a54d26 in __assert_fail ( assertion=0x5637b683b07c "fiber() == applier->fiber", file=0x5637b683a03e "./src/box/applier.cc", line=2809, function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60, state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83 tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2749 tarantool#13 0x00005637b62dc189 in replication_shutdown () ``` [2] Stack if we return 0: ``` tarantool#5 0x00007f84fdc54d26 in __assert_fail ( assertion=0x5590c7018130 "!diag_is_empty(&applier->fiber->diag)", file=0x5590c701703e "./src/box/applier.cc", line=2889, function=0x5590c70180b0 "int applier_wait_for_state(applier_on_state*, double)") at assert.c:101 tarantool#6 0x00005590c6ace340 in applier_wait_for_state (trigger=0x7f84fd780a60, timeout=3153600000) at /home/shiny/dev/tarantool/src/box/applier.cc:2889 tarantool#7 0x00005590c6ace426 in applier_resume_to_state (applier=0x5590c8c49a60, state=APPLIER_READY, timeout=3153600000) at /home/shiny/dev/tarantool/src/box/applier.cc:2903 tarantool#8 0x00005590c6aa230b in bootstrap_from_master (master=0x5590c8c003c0) at /home/shiny/dev/tarantool/src/box/box.cc:4943 tarantool#9 0x00005590c6aa27a5 in bootstrap (is_bootstrap_leader=0x7f84fd780b81) at /home/shiny/dev/tarantool/src/box/box.cc:5085 tarantool#10 0x00005590c6aa39f5 in box_cfg_xc () at /home/shiny/dev/tarantool/src/box/box.cc:5431 tarantool#11 0x00005590c6aa3fbf in box_cfg () at /home/shiny/dev/tarantool/src/box/box.cc:5560 ``` [3] Stack if we return -1: ``` tarantool#5 0x00007f8dc1054d26 in __assert_fail ( assertion=0x55ad141f08a8 "!diag_is_empty(&fiber->diag)", file=0x55ad141f03d8 "./src/lib/core/fiber.c", line=821, function=0x55ad141f10a0 <__PRETTY_FUNCTION__.45> "fiber_join_timeout") at assert.c:101 tarantool#6 0x000055ad13c582c2 in fiber_join_timeout (fiber=0x7f8dc0811150, timeout=3153600000) at /home/shiny/dev/tarantool/src/lib/core/fiber.c:821 tarantool#7 0x000055ad13c57f1a in fiber_join (fiber=0x7f8dc0811150) at /home/shiny/dev/tarantool/src/lib/core/fiber.c:768 tarantool#8 0x000055ad13b15b11 in applier_stop (applier=0x55ad14513a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2752 tarantool#9 0x000055ad13b01189 in replication_shutdown () ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Dec 29, 2023
There is issue with graceful replication shutdown. A good example (on which `replication/shutdown_test.lua` is based) is bootstrapping a replica with wrong auth in replication URI. In this case applier is sleeping in reconnect delay and bootstrap code waiting for READY state. Now comes server shutdown. Applier is stopped during shutdown and we hit assertion [1]. The issue is we don't expect the trigger set by `applier_resume_to_state` of bootstrap. We can clear the trigger as in other places calling `applier_disconnect(APPLIER_OFF)` in case applier is cancelled during reconnect sleep. But this does not help. The issue is as applier fiber is cancelled it is returned immediately from `applier_disconnect` unlike the other cases. Now if we return 0 from applier thread then applier fiber diag is cleared on fiber termination and we got assertion in the trigger [2]. If we return -1 then diag is stealed by the trigger and we got assertion on applier fiber join [3]. AFAIU trigger installed by `applier_resume_to_state` on error pauses applier fiber so that error diag can be stealed by the function. Then on stop applier fiber is cancelled and exits with 0 so join does not expect a diag. Let's move from this solution of pausing fiber on error just to keep diag. Let the fiber finish on error but keep the diag in applier state. Also looks like we return -1 from applier fiber only to keep diag to be shown by `box.info.replication` as we never use the result of this fiber join. Now when we get this info from `applier->diag` we can return 0 as in other places. Part of tarantool#8423 NO_CHANGELOG=fixing unreleased issue NO_DOC=bugfix [1] Issue assertion stack: ``` tarantool#5 0x00007fe877a54d26 in __assert_fail ( assertion=0x5637b683b07c "fiber() == applier->fiber", file=0x5637b683a03e "./src/box/applier.cc", line=2809, function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60, state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83 tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2749 tarantool#13 0x00005637b62dc189 in replication_shutdown () ``` [2] Stack if we return 0: ``` tarantool#5 0x00007f84fdc54d26 in __assert_fail ( assertion=0x5590c7018130 "!diag_is_empty(&applier->fiber->diag)", file=0x5590c701703e "./src/box/applier.cc", line=2889, function=0x5590c70180b0 "int applier_wait_for_state(applier_on_state*, double)") at assert.c:101 tarantool#6 0x00005590c6ace340 in applier_wait_for_state (trigger=0x7f84fd780a60, timeout=3153600000) at /home/shiny/dev/tarantool/src/box/applier.cc:2889 tarantool#7 0x00005590c6ace426 in applier_resume_to_state (applier=0x5590c8c49a60, state=APPLIER_READY, timeout=3153600000) at /home/shiny/dev/tarantool/src/box/applier.cc:2903 tarantool#8 0x00005590c6aa230b in bootstrap_from_master (master=0x5590c8c003c0) at /home/shiny/dev/tarantool/src/box/box.cc:4943 tarantool#9 0x00005590c6aa27a5 in bootstrap (is_bootstrap_leader=0x7f84fd780b81) at /home/shiny/dev/tarantool/src/box/box.cc:5085 tarantool#10 0x00005590c6aa39f5 in box_cfg_xc () at /home/shiny/dev/tarantool/src/box/box.cc:5431 tarantool#11 0x00005590c6aa3fbf in box_cfg () at /home/shiny/dev/tarantool/src/box/box.cc:5560 ``` [3] Stack if we return -1: ``` tarantool#5 0x00007f8dc1054d26 in __assert_fail ( assertion=0x55ad141f08a8 "!diag_is_empty(&fiber->diag)", file=0x55ad141f03d8 "./src/lib/core/fiber.c", line=821, function=0x55ad141f10a0 <__PRETTY_FUNCTION__.45> "fiber_join_timeout") at assert.c:101 tarantool#6 0x000055ad13c582c2 in fiber_join_timeout (fiber=0x7f8dc0811150, timeout=3153600000) at /home/shiny/dev/tarantool/src/lib/core/fiber.c:821 tarantool#7 0x000055ad13c57f1a in fiber_join (fiber=0x7f8dc0811150) at /home/shiny/dev/tarantool/src/lib/core/fiber.c:768 tarantool#8 0x000055ad13b15b11 in applier_stop (applier=0x55ad14513a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2752 tarantool#9 0x000055ad13b01189 in replication_shutdown () ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Jan 10, 2024
There is issue with graceful replication shutdown. A good example (on which `replication/shutdown_test.lua` is based) is bootstrapping a replica with wrong auth in replication URI. In this case applier is sleeping in reconnect delay and bootstrap code waiting for READY state. Now comes server shutdown. Applier is stopped during shutdown and we hit assertion [1]. The issue is we miss bootstrap fiber notification that applier fiber is cancelled. We can fix that but then another issue arise. Bootstrap fiber steals applier fiber diag in `applier_wait_for_state` and later join in `applier_stop` hit assertion as diag is expected. We can fix it as well by copying error in `applier_wait_for_state` instead of stealing. And it looks like simple solution. But approach has some drawbacks. First which is already present is that we rely on dead applier fiber diag in `*.replication.*.upstream.message` statistic in case applier is not yet stopped. Second which newly arises is that during `applier_stop` we overwrite current diag due to applier fiber cancel and join. So let's instead keep diag in `applier->diag`. Also let's copy error in `applier_wait_for_state` so that fiber error is not disappeared in case of bootstrap. Let's also do not pause applier fiber in `applier_on_state_f` in case applier is stopped/off. AFAIU we do it only to keep diag which will be cleared on return from fiber with 0 result. Now we don't need this. Also if fiber is cancelled this logic does not work anyway and we'd better have single logic for any type of error path. Also looks like we return -1 from applier fiber only to keep diag to be shown by `box.info.replication` as we never use the result of this fiber join. Now when we get this info from `applier->diag` we can return 0 as in other places. Part of tarantool#8423 NO_CHANGELOG=fixing unreleased issue NO_DOC=bugfix [1] Issue assertion stack: ``` tarantool#5 0x00007fe877a54d26 in __assert_fail ( assertion=0x5637b683b07c "fiber() == applier->fiber", file=0x5637b683a03e "./src/box/applier.cc", line=2809, function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60, state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83 tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2749 tarantool#13 0x00005637b62dc189 in replication_shutdown () ```
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Jan 11, 2024
This implies finishing replication TX fibers and stopping applier threads. This is easy to do using existing applier_stop. We also need to make sure that there is no client fibers are in replication code after shutdown. Otherwise we may have difficulties (assertions) while freeing replication resources. This goal have two sides. First we have to finish client fibers waiting in replication code and second we should not allow to wait after shutdonw is done. Here we probably can achieve first side by just stopping appliers. But in this case client will get error other than FiberIsCancelled which is nice to have. So approach is to track client fibers in replication code and cancel them on shutdown. This approach is also aligned with iproto/relay shutdown. There is issue with graceful replication shutdown though. A good example (on which `replication/shutdown_test.lua` is based) is bootstrapping a replica with wrong auth in replication URI. In this case applier is sleeping in reconnect delay and bootstrap code waiting for READY state. Now comes server shutdown. Applier is stopped during shutdown and we hit assertion [1]. The issue is we miss bootstrap fiber notification that applier fiber is cancelled. That's why the change with `fiber_testcancel` in `applier_f`. We also drop the assertion in the `replica_on_applier_sync` because applier can switch to OFF state from any previous state if we cancel applier fiber. Part of tarantool#8423 [1] Issue assertion stack: ``` tarantool#5 0x00007fe877a54d26 in __assert_fail ( assertion=0x5637b683b07c "fiber() == applier->fiber", file=0x5637b683a03e "./src/box/applier.cc", line=2809, function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101 tarantool#6 0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2809 tarantool#7 0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845 tarantool#8 0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100 tarantool#9 0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133 tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173 tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60, state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83 tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2749 tarantool#13 0x00005637b62dc189 in replication_shutdown () ``` NO_CHANGELOG=internal NO_DOC=internal
locker
pushed a commit
that referenced
this pull request
Jan 12, 2024
This implies finishing replication TX fibers and stopping applier threads. This is easy to do using existing applier_stop. We also need to make sure that there is no client fibers are in replication code after shutdown. Otherwise we may have difficulties (assertions) while freeing replication resources. This goal have two sides. First we have to finish client fibers waiting in replication code and second we should not allow to wait after shutdonw is done. Here we probably can achieve first side by just stopping appliers. But in this case client will get error other than FiberIsCancelled which is nice to have. So approach is to track client fibers in replication code and cancel them on shutdown. This approach is also aligned with iproto/relay shutdown. There is issue with graceful replication shutdown though. A good example (on which `replication/shutdown_test.lua` is based) is bootstrapping a replica with wrong auth in replication URI. In this case applier is sleeping in reconnect delay and bootstrap code waiting for READY state. Now comes server shutdown. Applier is stopped during shutdown and we hit assertion [1]. The issue is we miss bootstrap fiber notification that applier fiber is cancelled. That's why the change with `fiber_testcancel` in `applier_f`. We also drop the assertion in the `replica_on_applier_sync` because applier can switch to OFF state from any previous state if we cancel applier fiber. Part of #8423 [1] Issue assertion stack: ``` #5 0x00007fe877a54d26 in __assert_fail ( assertion=0x5637b683b07c "fiber() == applier->fiber", file=0x5637b683a03e "./src/box/applier.cc", line=2809, function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101 #6 0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2809 #7 0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845 #8 0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100 #9 0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133 #10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173 #11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60, state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83 #12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2749 #13 0x00005637b62dc189 in replication_shutdown () ``` NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 8, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot. Let's shutdown this fiber too. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 9, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot. Let's shutdown this fiber too. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 14, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot. Let's shutdown this fiber too. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 15, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot. Let's shutdown this fiber too. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 16, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot because of cord making snaphshot. Let's shutdown this fiber too. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 16, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot because of cord making snaphshot. Let's just trigger making snapshot in gc subsystem in it's own worker fiber. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 19, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot because of cord making snaphshot. Let's just trigger making snapshot in gc subsystem in it's own worker fiber. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
nshy
added a commit
to nshy/tarantool
that referenced
this pull request
Feb 19, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot because of cord making snaphshot. Let's just trigger making snapshot in gc subsystem in it's own worker fiber. ``` tarantool#5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 tarantool#6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 tarantool#7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 tarantool#8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of tarantool#8423 NO_CHANGELOG=internal NO_DOC=internal
locker
pushed a commit
that referenced
this pull request
Feb 19, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber. It can interfere with Tarantool shutdown. In particular there is an assertion on shutdown during such a snapshot because of cord making snaphshot. Let's just trigger making snapshot in gc subsystem in it's own worker fiber. ``` #5 0x00007e7ec9a54d26 in __assert_fail ( assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0", file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290, function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101 #6 0x000063ad061a6a91 in fiber_free () at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290 #7 0x000063ad05edc216 in tarantool_free () at /home/shiny/dev/tarantool/src/main.cc:632 #8 0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0) ``` Part of #8423 NO_CHANGELOG=internal NO_DOC=internal
ligurio
added a commit
to ligurio/nanodata
that referenced
this pull request
May 21, 2024
[001] tarantool#4 0x65481f151c11 in luaT_httpc_io_cleanup+33 [001] tarantool#5 0x65481f19ee63 in lj_BC_FUNCC+70 [001] tarantool#6 0x65481f1aa5d5 in gc_call_finalizer+133 [001] tarantool#7 0x65481f1ab1e3 in gc_onestep+211 [001] tarantool#8 0x65481f1aba68 in lj_gc_fullgc+120 [001] tarantool#9 0x65481f1a5fb5 in lua_gc+149 [001] tarantool#10 0x65481f1b57cf in lj_cf_collectgarbage+127 [001] tarantool#11 0x65481f19ee63 in lj_BC_FUNCC+70 [001] tarantool#12 0x65481f1a5c15 in lua_pcall+117 [001] tarantool#13 0x65481f14559f in luaT_call+15 [001] tarantool#14 0x65481f13c7e1 in lua_main+97 [001] tarantool#15 0x65481f13d000 in run_script_f+2032 NO_CHANGELOG=internal NO_DOC=internal NO_TEST=internal
ligurio
added a commit
to ligurio/nanodata
that referenced
this pull request
May 21, 2024
[001] tarantool#4 0x65481f151c11 in luaT_httpc_io_cleanup+33 [001] tarantool#5 0x65481f19ee63 in lj_BC_FUNCC+70 [001] tarantool#6 0x65481f1aa5d5 in gc_call_finalizer+133 [001] tarantool#7 0x65481f1ab1e3 in gc_onestep+211 [001] tarantool#8 0x65481f1aba68 in lj_gc_fullgc+120 [001] tarantool#9 0x65481f1a5fb5 in lua_gc+149 [001] tarantool#10 0x65481f1b57cf in lj_cf_collectgarbage+127 [001] tarantool#11 0x65481f19ee63 in lj_BC_FUNCC+70 [001] tarantool#12 0x65481f1a5c15 in lua_pcall+117 [001] tarantool#13 0x65481f14559f in luaT_call+15 [001] tarantool#14 0x65481f13c7e1 in lua_main+97 [001] tarantool#15 0x65481f13d000 in run_script_f+2032 NO_CHANGELOG=internal NO_DOC=internal NO_TEST=internal
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[ 65%] Building CXX object src/box/CMakeFiles/ltbox.dir//assoc.m.o
[ 67%] Building CXX object src/box/CMakeFiles/ltbox.dir//replication.m.o
/usr/home/zloidemon/Repos/zloidemon/databases/tarantool/work/tarantool-1.4.7-328-g1af7b0b-src/src/replication.m: In function 'replication_relay_loop':
/usr/home/zloidemon/Repos/zloidemon/databases/tarantool/work/tarantool-1.4.7-328-g1af7b0b-src/src/replication.m:598:2: error: passing argument 2 of 'getpeername' from incompatible pointer type [-Werror]
/usr/include/sys/socket.h:630:5: note: expected 'struct sockaddr * restrict' but argument is of type 'struct sockaddr_in '
cc1obj: all warnings being treated as errors
gmake[2]: ** [src/box/CMakeFiles/ltbox.dir/__/replication.m.o] Ошибка 1
gmake[1]: *** [src/box/CMakeFiles/ltbox.dir/all] Ошибка 2
gmake: *** [all] Ошибка 2
*** Error code 1
Stop in /usr/home/zloidemon/Repos/zloidemon/databases/tarantool.
*** Error code 1
Stop in /usr/home/zloidemon/Repos/zloidemon/databases/tarantool.