Skip to content

Creating Unittests#5

Merged
bigbes merged 1 commit intojava-connectorfrom
jcon
Aug 6, 2012
Merged

Creating Unittests#5
bigbes merged 1 commit intojava-connectorfrom
jcon

Conversation

@bigbes
Copy link
Contributor

@bigbes bigbes commented Aug 6, 2012

No description provided.

bigbes added a commit that referenced this pull request Aug 6, 2012
@bigbes bigbes merged commit ebbb297 into java-connector Aug 6, 2012
@bigbes bigbes self-assigned this Oct 24, 2014
@avid avid mentioned this pull request Nov 1, 2014
zloidemon added a commit that referenced this pull request Mar 24, 2015
@mialinx mialinx mentioned this pull request Apr 3, 2015
@a0s a0s mentioned this pull request Oct 6, 2015
@YadrovSergey YadrovSergey mentioned this pull request Feb 8, 2016
@void234 void234 mentioned this pull request Dec 28, 2020
tsafin added a commit that referenced this pull request Dec 29, 2020
Introduced OUT_TITLE_(ibuf,title) in addition to OUT_TUPLE_TITLE
to have a chance to output only key with expected following value,
inlike for tuple where we emit whole named tuple, i.e.
	{ 'select': (...value...) }
ligurio added a commit that referenced this pull request Jan 26, 2021
- make test

Closes #5

Part of #12
Closes #11
Closes #10
Closes #3
drakonhg pushed a commit that referenced this pull request Sep 2, 2021
nshy added a commit to nshy/tarantool that referenced this pull request Dec 7, 2023
We may need to cancel fiber that waits for cord to finish. For this
purpose let's cancel fiber started by cord_costart inside the cord.

Note that there is a race between stopping cancel_event in cord and
triggering it using ev_async_send in joining thread. AFAIU it is safe.

We also need to fix stopping wal cord to address stack-use-after-return
issue shown below. Is arises because we did not stop async which resides
in wal endpoint and endpoint resides on stack. Later when we stop the
introduced cancel_event we access not stopped async which at this moment
gone out of scope.

```
==3224698==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f654b3b0170 at pc 0x555a2817c282 bp 0x7f654ca55b30 sp 0x7f654ca55b28
WRITE of size 4 at 0x7f654b3b0170 thread T3
    #0 0x555a2817c281 in ev_async_stop /home/shiny/dev/tarantool/third_party/libev/ev.c:5492:37
    tarantool#1 0x555a27827738 in cord_thread_func /home/shiny/dev/tarantool/src/lib/core/fiber.c:1990:2
    tarantool#2 0x7f65574aa9ea in start_thread /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444:8
    tarantool#3 0x7f655752e7cb in clone3 /usr/src/debug/glibc/glibc/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```

But after that we also need to temporarily comment freeing applier
threads. The issue is appiler_free goes first and then wal_free. The
first does not correctly free resources. In particular does not destroy
thread endpoint but frees its memory. As a result we got use-after-free
on destroying wal endpoint.

```
==3508646==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b000001a30 at pc 0x5556ff1b08d8 bp 0x7f69cb7f65c0 sp 0x7f69cb7f65b8
WRITE of size 8 at 0x61b000001a30 thread T3
    #0 0x5556ff1b08d7 in rlist_del /home/shiny/dev/tarantool/src/lib/small/include/small/rlist.h:101:19
    tarantool#1 0x5556ff1b08d7 in cbus_endpoint_destroy /home/shiny/dev/tarantool/src/lib/core/cbus.c:256:2
    tarantool#2 0x5556feea1f2c in wal_writer_f /home/shiny/dev/tarantool/src/box/wal.c:1237:2
    tarantool#3 0x5556fea3eb57 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1297:10
    tarantool#4 0x5556ff19af3e in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1160:18
    tarantool#5 0x5556ffb0fbd2 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3

0x61b000001a30 is located 1200 bytes inside of 1528-byte region [0x61b000001580,0x61b000001b78)
freed by thread T0 here:
    #0 0x5556fe9ed8a2 in __interceptor_free.part.0 asan_malloc_linux.cpp.o
    tarantool#1 0x5556fee4ef65 in applier_free /home/shiny/dev/tarantool/src/box/applier.cc:2175:3
    tarantool#2 0x5556fedfce01 in box_storage_free() /home/shiny/dev/tarantool/src/box/box.cc:5869:2
    tarantool#3 0x5556fedfce01 in box_free /home/shiny/dev/tarantool/src/box/box.cc:5936:2
    tarantool#4 0x5556fea3cfec in tarantool_free() /home/shiny/dev/tarantool/src/main.cc:575:2
    tarantool#5 0x5556fea3cfec in main /home/shiny/dev/tarantool/src/main.cc:1087:2
    tarantool#6 0x7f69d7445ccf in __libc_start_call_main /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```

Part of tarantool#8423

NO_DOC=internal
NO_CHANGELOG=internal
nshy added a commit to nshy/tarantool that referenced this pull request Dec 7, 2023
We may need to cancel fiber that waits for cord to finish. For this
purpose let's cancel fiber started by cord_costart inside the cord.

Note that there is a race between stopping cancel_event in cord and
triggering it using ev_async_send in joining thread. AFAIU it is safe.

We also need to fix stopping wal cord to address stack-use-after-return
issue shown below. Is arises because we did not stop async which resides
in wal endpoint and endpoint resides on stack. Later when we stop the
introduced cancel_event we access not stopped async which at this moment
gone out of scope.

```
==3224698==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f654b3b0170 at pc 0x555a2817c282 bp 0x7f654ca55b30 sp 0x7f654ca55b28
WRITE of size 4 at 0x7f654b3b0170 thread T3
    #0 0x555a2817c281 in ev_async_stop /home/shiny/dev/tarantool/third_party/libev/ev.c:5492:37
    tarantool#1 0x555a27827738 in cord_thread_func /home/shiny/dev/tarantool/src/lib/core/fiber.c:1990:2
    tarantool#2 0x7f65574aa9ea in start_thread /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444:8
    tarantool#3 0x7f655752e7cb in clone3 /usr/src/debug/glibc/glibc/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```

But after that we also need to temporarily comment freeing applier
threads. The issue is appiler_free goes first and then wal_free. The
first does not correctly free resources. In particular does not destroy
thread endpoint but frees its memory. As a result we got use-after-free
on destroying wal endpoint.

```
==3508646==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b000001a30 at pc 0x5556ff1b08d8 bp 0x7f69cb7f65c0 sp 0x7f69cb7f65b8
WRITE of size 8 at 0x61b000001a30 thread T3
    #0 0x5556ff1b08d7 in rlist_del /home/shiny/dev/tarantool/src/lib/small/include/small/rlist.h:101:19
    tarantool#1 0x5556ff1b08d7 in cbus_endpoint_destroy /home/shiny/dev/tarantool/src/lib/core/cbus.c:256:2
    tarantool#2 0x5556feea1f2c in wal_writer_f /home/shiny/dev/tarantool/src/box/wal.c:1237:2
    tarantool#3 0x5556fea3eb57 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1297:10
    tarantool#4 0x5556ff19af3e in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1160:18
    tarantool#5 0x5556ffb0fbd2 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3

0x61b000001a30 is located 1200 bytes inside of 1528-byte region [0x61b000001580,0x61b000001b78)
freed by thread T0 here:
    #0 0x5556fe9ed8a2 in __interceptor_free.part.0 asan_malloc_linux.cpp.o
    tarantool#1 0x5556fee4ef65 in applier_free /home/shiny/dev/tarantool/src/box/applier.cc:2175:3
    tarantool#2 0x5556fedfce01 in box_storage_free() /home/shiny/dev/tarantool/src/box/box.cc:5869:2
    tarantool#3 0x5556fedfce01 in box_free /home/shiny/dev/tarantool/src/box/box.cc:5936:2
    tarantool#4 0x5556fea3cfec in tarantool_free() /home/shiny/dev/tarantool/src/main.cc:575:2
    tarantool#5 0x5556fea3cfec in main /home/shiny/dev/tarantool/src/main.cc:1087:2
    tarantool#6 0x7f69d7445ccf in __libc_start_call_main /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```

Part of tarantool#8423

NO_DOC=internal
NO_CHANGELOG=internal
nshy added a commit to nshy/tarantool that referenced this pull request Dec 7, 2023
We may need to cancel fiber that waits for cord to finish. For this
purpose let's cancel fiber started by cord_costart inside the cord.

Note that there is a race between stopping cancel_event in cord and
triggering it using ev_async_send in joining thread. AFAIU it is safe.

We also need to fix stopping wal cord to address stack-use-after-return
issue shown below. Is arises because we did not stop async which resides
in wal endpoint and endpoint resides on stack. Later when we stop the
introduced cancel_event we access not stopped async which at this moment
gone out of scope.

```
==3224698==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f654b3b0170 at pc 0x555a2817c282 bp 0x7f654ca55b30 sp 0x7f654ca55b28
WRITE of size 4 at 0x7f654b3b0170 thread T3
    #0 0x555a2817c281 in ev_async_stop /home/shiny/dev/tarantool/third_party/libev/ev.c:5492:37
    tarantool#1 0x555a27827738 in cord_thread_func /home/shiny/dev/tarantool/src/lib/core/fiber.c:1990:2
    tarantool#2 0x7f65574aa9ea in start_thread /usr/src/debug/glibc/glibc/nptl/pthread_create.c:444:8
    tarantool#3 0x7f655752e7cb in clone3 /usr/src/debug/glibc/glibc/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```

But after that we need to properly destroy endpoints in other threads
too. Check for example ASAN report for applier thread below. The issue
is applier endpoind is linked with wal endpoint and applier endpoint
memory is freed (without proper destroying) when we destroy wal endpoint.

The similar issue is with endpoints in vinyl threads. However we got
SIGSIGEV with them instead of proper ASAN report. Looks like the cause
is vinyl endpoints reside on stack. In case of applier we can just
temporarily comment freeing applier thread memory until proper applier
shutdown for the sake of this patch. But we can't do the same way for
vinyl threads. Let's just stop cbus_loop in both cases. It is not full
shutdown solution not for applier nor for vinyl as both may have running
fibers in threads. It is temporary solution just for this patch. We add
missing pieces in later patches.

```
==3508646==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b000001a30 at pc 0x5556ff1b08d8 bp 0x7f69cb7f65c0 sp 0x7f69cb7f65b8
WRITE of size 8 at 0x61b000001a30 thread T3
    #0 0x5556ff1b08d7 in rlist_del /home/shiny/dev/tarantool/src/lib/small/include/small/rlist.h:101:19
    tarantool#1 0x5556ff1b08d7 in cbus_endpoint_destroy /home/shiny/dev/tarantool/src/lib/core/cbus.c:256:2
    tarantool#2 0x5556feea1f2c in wal_writer_f /home/shiny/dev/tarantool/src/box/wal.c:1237:2
    tarantool#3 0x5556fea3eb57 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1297:10
    tarantool#4 0x5556ff19af3e in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1160:18
    tarantool#5 0x5556ffb0fbd2 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3

0x61b000001a30 is located 1200 bytes inside of 1528-byte region [0x61b000001580,0x61b000001b78)
freed by thread T0 here:
    #0 0x5556fe9ed8a2 in __interceptor_free.part.0 asan_malloc_linux.cpp.o
    tarantool#1 0x5556fee4ef65 in applier_free /home/shiny/dev/tarantool/src/box/applier.cc:2175:3
    tarantool#2 0x5556fedfce01 in box_storage_free() /home/shiny/dev/tarantool/src/box/box.cc:5869:2
    tarantool#3 0x5556fedfce01 in box_free /home/shiny/dev/tarantool/src/box/box.cc:5936:2
    tarantool#4 0x5556fea3cfec in tarantool_free() /home/shiny/dev/tarantool/src/main.cc:575:2
    tarantool#5 0x5556fea3cfec in main /home/shiny/dev/tarantool/src/main.cc:1087:2
    tarantool#6 0x7f69d7445ccf in __libc_start_call_main /usr/src/debug/glibc/glibc/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```

Part of tarantool#8423

NO_DOC=internal
NO_CHANGELOG=internal
nshy added a commit to nshy/tarantool that referenced this pull request Dec 8, 2023
Here is the issue with replication shutdown. It crashes if bootstrap is
in progress. Bootstrap uses resume_to_state API to wait for applier
state of interest. resume_to_state usually pause applier fiber on
applier stop/off and wakes bootstrap fiber to pass it the error. But if
applier fiber is cancelled like when shutdown then applier_pause returns
immediately. Which leads to the assertion later.

Even if we ignore this assertion somehow then we hit assertion in
bootstrap fiber in applier_wait_for_state as it expects diag set in case
of off/stop state. But diag is get eaten by applier fiber join in
shutdown fiber.

On applier error if there is fiber using resume_to_state API we first
suspend applier fiber and only exit it when applier fiber get canceled
on applier stop. AFAIU this complicated mechanics to keep fiber alive on
errors is only to keep fiber diag for bootstrap fiber. Instead let's
move diag for bootstrap out of fiber. Now we don't need to keep fiber
alive on errors.

The current approach on passing diag has other oddities. Like we won't
finish disconnect on errors immediately. Or have to return from
applier_f -1 or 0 on errors depending on do we expect another fiber to
steal fiber diag or not.

Part of tarantool#8423

NO_TEST=rely on existing tests
NO_CHANGELOG=internal
NO_DOC=internal

Shutdown fiber stack:
```
  tarantool#5  0x00007f84f0c54d26 in __assert_fail (
      assertion=0x564ffd5dfdec "fiber() == applier->fiber",
      file=0x564ffd5dedae "./src/box/applier.cc", line=2809,
      function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0)
      at /<snap>/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60,
      event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0,
      event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0,
      state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83
  tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0)
      at /<snap>/tarantool/src/box/applier.cc:2749
  tarantool#13 0x0000564ffd08b8c7 in replication_shutdown ()
```

Bootstrap fiber stack:

```
  tarantool#1  0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true)
      at /<snap>/tarantool/src/lib/core/fiber.c:863
  tarantool#2  0x0000564ffd1e0caa in fiber_yield ()
      at /<snap>/tarantool/src/lib/core/fiber.c:870
  tarantool#3  0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber.c:914
  tarantool#4  0x0000564ffd1ed401 in fiber_cond_wait_timeout
      (c=0x7f84f0780aa8, timeout=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:107
  tarantool#5  0x0000564ffd1ed61c in fiber_cond_wait_deadline
      (c=0x7f84f0780aa8, deadline=3155292512.1470647)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:128
  tarantool#6  0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double)
      (trigger=0x7f84f0780a60, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2877
  tarantool#7  0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double)
      (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2898
  tarantool#8  0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0)
      at /<snap>/tarantool/src/box/box.cc:4980
  tarantool#9  0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81)
      at /<snap>/tarantool/src/box/box.cc:5081
  tarantool#10 0x0000564ffd07613d in box_cfg_xc() ()
      at /<snap>/tarantool/src/box/box.cc:5427
```
nshy added a commit to nshy/tarantool that referenced this pull request Dec 8, 2023
Here is the issue with replication shutdown. It crashes if bootstrap is
in progress. Bootstrap uses resume_to_state API to wait for applier
state of interest. resume_to_state usually pause applier fiber on
applier stop/off and wakes bootstrap fiber to pass it the error. But if
applier fiber is cancelled like when shutdown then applier_pause returns
immediately. Which leads to the assertion later.

Even if we ignore this assertion somehow then we hit assertion in
bootstrap fiber in applier_wait_for_state as it expects diag set in case
of off/stop state. But diag is get eaten by applier fiber join in
shutdown fiber.

On applier error if there is fiber using resume_to_state API we first
suspend applier fiber and only exit it when applier fiber get canceled
on applier stop. AFAIU this complicated mechanics to keep fiber alive on
errors is only to keep fiber diag for bootstrap fiber. Instead let's
move diag for bootstrap out of fiber. Now we don't need to keep fiber
alive on errors.

The current approach on passing diag has other oddities. Like we won't
finish disconnect on errors immediately. Or have to return from
applier_f -1 or 0 on errors depending on do we expect another fiber to
steal fiber diag or not.

Part of tarantool#8423

NO_TEST=rely on existing tests
NO_CHANGELOG=internal
NO_DOC=internal

Shutdown fiber stack:
```
  tarantool#5  0x00007f84f0c54d26 in __assert_fail (
      assertion=0x564ffd5dfdec "fiber() == applier->fiber",
      file=0x564ffd5dedae "./src/box/applier.cc", line=2809,
      function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0)
      at /<snap>/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60,
      event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0,
      event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0,
      state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83
  tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0)
      at /<snap>/tarantool/src/box/applier.cc:2749
  tarantool#13 0x0000564ffd08b8c7 in replication_shutdown ()
```

Bootstrap fiber stack:

```
  tarantool#1  0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true)
      at /<snap>/tarantool/src/lib/core/fiber.c:863
  tarantool#2  0x0000564ffd1e0caa in fiber_yield ()
      at /<snap>/tarantool/src/lib/core/fiber.c:870
  tarantool#3  0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber.c:914
  tarantool#4  0x0000564ffd1ed401 in fiber_cond_wait_timeout
      (c=0x7f84f0780aa8, timeout=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:107
  tarantool#5  0x0000564ffd1ed61c in fiber_cond_wait_deadline
      (c=0x7f84f0780aa8, deadline=3155292512.1470647)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:128
  tarantool#6  0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double)
      (trigger=0x7f84f0780a60, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2877
  tarantool#7  0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double)
      (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2898
  tarantool#8  0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0)
      at /<snap>/tarantool/src/box/box.cc:4980
  tarantool#9  0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81)
      at /<snap>/tarantool/src/box/box.cc:5081
  tarantool#10 0x0000564ffd07613d in box_cfg_xc() ()
      at /<snap>/tarantool/src/box/box.cc:5427
```
nshy added a commit to nshy/tarantool that referenced this pull request Dec 15, 2023
Actually newly introduced 'replication/shutdown_test.lua' has an issue
(all three subtests). The master is shutdown in time but replica
crashes. Below are replica stacks during crash. The story is next.
Master is shutdown and replica gets `SocketError` and sleeps on
reconnect timeout. Now replica is shutdown. `applier_f` loop exits
immediately without waking bootstrap fiber. Next we change state to OFF
in shutdown fiber in `applier_stop` triggering `resume_to_state`
trigger. It calls `applier_pause` that expects fiber to be applier fiber.

So we'd better to wakeup bootstrap fiber in this case. But simple change
to check cancel state after reconnection sleep does not work yet. This
time `applier_pause` exits immediately which is not the way
`resume_to_state` should work. See, if we return 0 from applier fiber
then we clear fiber diag on fiber death which is not expected by
`resume_to_state`. If we return -1, then `resume_to_state` will steal
fiber diag which is not expected by `fiber_join` in `applier_stop`.

TODO (check -1 case in experiment!)

Even if we ignore this assertion somehow then we hit assertion in
bootstrap fiber in applier_wait_for_state as it expects diag set in case
of off/stop state. But diag is get eaten by applier fiber join in
shutdown fiber.

On applier error if there is fiber using resume_to_state API we first
suspend applier fiber and only exit it when applier fiber get canceled
on applier stop. AFAIU this complicated mechanics to keep fiber alive on
errors is only to keep fiber diag for bootstrap fiber. Instead let's
move diag for bootstrap out of fiber. Now we don't need to keep fiber
alive on errors.

The current approach on passing diag has other oddities. Like we won't
finish disconnect on errors immediately. Or have to return from
applier_f -1 or 0 on errors depending on do we expect another fiber to
steal fiber diag or not.

Part of tarantool#8423

NO_TEST=rely on existing tests
NO_CHANGELOG=internal
NO_DOC=internal

Shutdown fiber stack (crash):
```
  tarantool#5  0x00007f84f0c54d26 in __assert_fail (
      assertion=0x564ffd5dfdec "fiber() == applier->fiber",
      file=0x564ffd5dedae "./src/box/applier.cc", line=2809,
      function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0)
      at /<snap>/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60,
      event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0,
      event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0,
      state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83
  tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0)
      at /<snap>/tarantool/src/box/applier.cc:2749
  tarantool#13 0x0000564ffd08b8c7 in replication_shutdown ()
```

Bootstrap fiber stack:

```
  tarantool#1  0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true)
      at /<snap>/tarantool/src/lib/core/fiber.c:863
  tarantool#2  0x0000564ffd1e0caa in fiber_yield ()
      at /<snap>/tarantool/src/lib/core/fiber.c:870
  tarantool#3  0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber.c:914
  tarantool#4  0x0000564ffd1ed401 in fiber_cond_wait_timeout
      (c=0x7f84f0780aa8, timeout=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:107
  tarantool#5  0x0000564ffd1ed61c in fiber_cond_wait_deadline
      (c=0x7f84f0780aa8, deadline=3155292512.1470647)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:128
  tarantool#6  0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double)
      (trigger=0x7f84f0780a60, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2877
  tarantool#7  0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double)
      (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2898
  tarantool#8  0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0)
      at /<snap>/tarantool/src/box/box.cc:4980
  tarantool#9  0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81)
      at /<snap>/tarantool/src/box/box.cc:5081
  tarantool#10 0x0000564ffd07613d in box_cfg_xc() ()
      at /<snap>/tarantool/src/box/box.cc:5427
```
nshy added a commit to nshy/tarantool that referenced this pull request Dec 18, 2023
Actually newly introduced 'replication/shutdown_test.lua' has an issue
(all three subtests). The master is shutdown in time but replica
crashes. Below are replica stacks during crash. The story is next.
Master is shutdown and replica gets `SocketError` and sleeps on
reconnect timeout. Now replica is shutdown. `applier_f` loop exits
immediately without waking bootstrap fiber. Next we change state to OFF
in shutdown fiber in `applier_stop` triggering `resume_to_state`
trigger. It calls `applier_pause` that expects fiber to be applier fiber.

So we'd better to wakeup bootstrap fiber in this case. But simple change
to check cancel state after reconnection sleep does not work yet. This
time `applier_pause` exits immediately which is not the way
`resume_to_state` should work. See, if we return 0 from applier fiber
then we clear fiber diag on fiber death which is not expected by
`resume_to_state`. If we return -1, then `resume_to_state` will steal
fiber diag which is not expected by `fiber_join` in `applier_stop`.

TODO (check -1 case in experiment!)

Even if we ignore this assertion somehow then we hit assertion in
bootstrap fiber in applier_wait_for_state as it expects diag set in case
of off/stop state. But diag is get eaten by applier fiber join in
shutdown fiber.

On applier error if there is fiber using resume_to_state API we first
suspend applier fiber and only exit it when applier fiber get canceled
on applier stop. AFAIU this complicated mechanics to keep fiber alive on
errors is only to keep fiber diag for bootstrap fiber. Instead let's
move diag for bootstrap out of fiber. Now we don't need to keep fiber
alive on errors.

The current approach on passing diag has other oddities. Like we won't
finish disconnect on errors immediately. Or have to return from
applier_f -1 or 0 on errors depending on do we expect another fiber to
steal fiber diag or not.

Part of tarantool#8423

NO_TEST=rely on existing tests
NO_CHANGELOG=internal
NO_DOC=internal

Shutdown fiber stack (crash):
```
  tarantool#5  0x00007f84f0c54d26 in __assert_fail (
      assertion=0x564ffd5dfdec "fiber() == applier->fiber",
      file=0x564ffd5dedae "./src/box/applier.cc", line=2809,
      function=0x564ffd5dfdcf "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x0000564ffd0a0780 in applier_pause (applier=0x564fff30b1e0)
      at /<snap>/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x0000564ffd0a08ab in applier_on_state_f (trigger=0x7f84f0780a60,
      event=0x564fff30b1e0) at /<snap>/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x0000564ffd20a873 in trigger_run_list (list=0x7f84f0680de0,
      event=0x564fff30b1e0) at /<snap>/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x0000564ffd20a991 in trigger_run (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x0000564ffd0945cb in trigger_run_xc (list=0x564fff30b818, event=0x564fff30b1e0)
      at /<snap>/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x0000564ffd09689a in applier_set_state (applier=0x564fff30b1e0,
      state=APPLIER_OFF) at /<snap>/tarantool/src/box/applier.cc:83
  tarantool#12 0x0000564ffd0a0313 in applier_stop (applier=0x564fff30b1e0)
      at /<snap>/tarantool/src/box/applier.cc:2749
  tarantool#13 0x0000564ffd08b8c7 in replication_shutdown ()
```

Bootstrap fiber stack:

```
  tarantool#1  0x0000564ffd1e0c95 in fiber_yield_impl (will_switch_back=true)
      at /<snap>/tarantool/src/lib/core/fiber.c:863
  tarantool#2  0x0000564ffd1e0caa in fiber_yield ()
      at /<snap>/tarantool/src/lib/core/fiber.c:870
  tarantool#3  0x0000564ffd1e0f0b in fiber_yield_timeout (delay=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber.c:914
  tarantool#4  0x0000564ffd1ed401 in fiber_cond_wait_timeout
      (c=0x7f84f0780aa8, timeout=3153600000)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:107
  tarantool#5  0x0000564ffd1ed61c in fiber_cond_wait_deadline
      (c=0x7f84f0780aa8, deadline=3155292512.1470647)
      at /<snap>/tarantool/src/lib/core/fiber_cond.c:128
  tarantool#6  0x0000564ffd0a0a00 in applier_wait_for_state(applier_on_state*, double)
      (trigger=0x7f84f0780a60, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2877
  tarantool#7  0x0000564ffd0a0bbb in applier_resume_to_state(applier*, applier_state, double)
      (applier=0x564fff30b1e0, state=APPLIER_JOINED, timeout=3153600000)
      at /<snap>/tarantool/src/box/applier.cc:2898
  tarantool#8  0x0000564ffd074bf4 in bootstrap_from_master(replica*) (master=0x564fff2b30d0)
      at /<snap>/tarantool/src/box/box.cc:4980
  tarantool#9  0x0000564ffd074eed in bootstrap(bool*) (is_bootstrap_leader=0x7f84f0780b81)
      at /<snap>/tarantool/src/box/box.cc:5081
  tarantool#10 0x0000564ffd07613d in box_cfg_xc() ()
      at /<snap>/tarantool/src/box/box.cc:5427
```
nshy added a commit to nshy/tarantool that referenced this pull request Dec 29, 2023
There is issue with graceful replication shutdown. A good example (on
which `replication/shutdown_test.lua` is based) is bootstrapping
a replica with wrong auth in replication URI. In this case applier is
sleeping in reconnect delay and bootstap code waiting for READY state.
Now comes server shutdown.

Applier is stopped during shutdown and we hit assertion [1]. The issue
is we don't expect the trigger set by `applier_resume_to_state` of
bootstrap. We can clear the trigger as in other places calling
`applier_disconnect(APPLIER_OFF)` in case applier is cancelled during
reconect sleep. But this does not help. The issue is as applier fiber is
cancelled it is returned immediately from `applier_disconnect` unlike
the other cases. Now if we return 0 from applier thread then applier
fiber diag is cleared on fiber termination and we got assertion in
the trigger [2]. If we return -1 then diag is stealed by the trigger
and we got assertion on applier fiber join [3].

AFAIU trigger installed by `applier_resume_to_state` on error pauses
applier fiber so that error diag can be stealed by the function. Then on
stop applier fiber is cancelled and exits with 0 so join does not expect
a diag.

Let's move from this solution of pausing fiber on error just to keep
diag. Let the fiber finish on error but keep the diag in applier
state.

Also looks like we return -1 from applier fiber only to keep diag to be
shown by `box.info.replication` as we never use the result of this fiber
join. Now when we get this info from `applier->diag` we can return
0 as in other places.

Part of tarantool#8423

NO_CHANGELOG=fixing unreleased issue
NO_DOC=bugfix

[1] Issue assertion stack:
```
  tarantool#5  0x00007fe877a54d26 in __assert_fail (
      assertion=0x5637b683b07c "fiber() == applier->fiber",
      file=0x5637b683a03e "./src/box/applier.cc", line=2809,
      function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60,
      event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0,
      event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60,
      state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83
  tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2749
  tarantool#13 0x00005637b62dc189 in replication_shutdown ()
```

[2] Stack if we return 0:

```
  tarantool#5  0x00007f84fdc54d26 in __assert_fail (
      assertion=0x5590c7018130 "!diag_is_empty(&applier->fiber->diag)",
      file=0x5590c701703e "./src/box/applier.cc", line=2889,
      function=0x5590c70180b0 "int applier_wait_for_state(applier_on_state*, double)")
      at assert.c:101
  tarantool#6  0x00005590c6ace340 in applier_wait_for_state (trigger=0x7f84fd780a60,
      timeout=3153600000) at /home/shiny/dev/tarantool/src/box/applier.cc:2889
  tarantool#7  0x00005590c6ace426 in applier_resume_to_state (applier=0x5590c8c49a60,
      state=APPLIER_READY, timeout=3153600000)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2903
  tarantool#8  0x00005590c6aa230b in bootstrap_from_master (master=0x5590c8c003c0)
      at /home/shiny/dev/tarantool/src/box/box.cc:4943
  tarantool#9  0x00005590c6aa27a5 in bootstrap (is_bootstrap_leader=0x7f84fd780b81)
      at /home/shiny/dev/tarantool/src/box/box.cc:5085
  tarantool#10 0x00005590c6aa39f5 in box_cfg_xc ()
      at /home/shiny/dev/tarantool/src/box/box.cc:5431
  tarantool#11 0x00005590c6aa3fbf in box_cfg () at /home/shiny/dev/tarantool/src/box/box.cc:5560
```

[3] Stack if we return -1:

```
  tarantool#5  0x00007f8dc1054d26 in __assert_fail (
      assertion=0x55ad141f08a8 "!diag_is_empty(&fiber->diag)",
      file=0x55ad141f03d8 "./src/lib/core/fiber.c", line=821,
      function=0x55ad141f10a0 <__PRETTY_FUNCTION__.45> "fiber_join_timeout")
      at assert.c:101
  tarantool#6  0x000055ad13c582c2 in fiber_join_timeout (fiber=0x7f8dc0811150,
      timeout=3153600000) at /home/shiny/dev/tarantool/src/lib/core/fiber.c:821
  tarantool#7  0x000055ad13c57f1a in fiber_join (fiber=0x7f8dc0811150)
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:768
  tarantool#8  0x000055ad13b15b11 in applier_stop (applier=0x55ad14513a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2752
  tarantool#9  0x000055ad13b01189 in replication_shutdown ()
```
nshy added a commit to nshy/tarantool that referenced this pull request Dec 29, 2023
There is issue with graceful replication shutdown. A good example (on
which `replication/shutdown_test.lua` is based) is bootstrapping
a replica with wrong auth in replication URI. In this case applier is
sleeping in reconnect delay and bootstrap code waiting for READY state.
Now comes server shutdown.

Applier is stopped during shutdown and we hit assertion [1]. The issue
is we don't expect the trigger set by `applier_resume_to_state` of
bootstrap. We can clear the trigger as in other places calling
`applier_disconnect(APPLIER_OFF)` in case applier is cancelled during
reconnect sleep. But this does not help. The issue is as applier fiber is
cancelled it is returned immediately from `applier_disconnect` unlike
the other cases. Now if we return 0 from applier thread then applier
fiber diag is cleared on fiber termination and we got assertion in
the trigger [2]. If we return -1 then diag is stealed by the trigger
and we got assertion on applier fiber join [3].

AFAIU trigger installed by `applier_resume_to_state` on error pauses
applier fiber so that error diag can be stealed by the function. Then on
stop applier fiber is cancelled and exits with 0 so join does not expect
a diag.

Let's move from this solution of pausing fiber on error just to keep
diag. Let the fiber finish on error but keep the diag in applier
state.

Also looks like we return -1 from applier fiber only to keep diag to be
shown by `box.info.replication` as we never use the result of this fiber
join. Now when we get this info from `applier->diag` we can return
0 as in other places.

Part of tarantool#8423

NO_CHANGELOG=fixing unreleased issue
NO_DOC=bugfix

[1] Issue assertion stack:
```
  tarantool#5  0x00007fe877a54d26 in __assert_fail (
      assertion=0x5637b683b07c "fiber() == applier->fiber",
      file=0x5637b683a03e "./src/box/applier.cc", line=2809,
      function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60,
      event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0,
      event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60,
      state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83
  tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2749
  tarantool#13 0x00005637b62dc189 in replication_shutdown ()
```

[2] Stack if we return 0:

```
  tarantool#5  0x00007f84fdc54d26 in __assert_fail (
      assertion=0x5590c7018130 "!diag_is_empty(&applier->fiber->diag)",
      file=0x5590c701703e "./src/box/applier.cc", line=2889,
      function=0x5590c70180b0 "int applier_wait_for_state(applier_on_state*, double)")
      at assert.c:101
  tarantool#6  0x00005590c6ace340 in applier_wait_for_state (trigger=0x7f84fd780a60,
      timeout=3153600000) at /home/shiny/dev/tarantool/src/box/applier.cc:2889
  tarantool#7  0x00005590c6ace426 in applier_resume_to_state (applier=0x5590c8c49a60,
      state=APPLIER_READY, timeout=3153600000)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2903
  tarantool#8  0x00005590c6aa230b in bootstrap_from_master (master=0x5590c8c003c0)
      at /home/shiny/dev/tarantool/src/box/box.cc:4943
  tarantool#9  0x00005590c6aa27a5 in bootstrap (is_bootstrap_leader=0x7f84fd780b81)
      at /home/shiny/dev/tarantool/src/box/box.cc:5085
  tarantool#10 0x00005590c6aa39f5 in box_cfg_xc ()
      at /home/shiny/dev/tarantool/src/box/box.cc:5431
  tarantool#11 0x00005590c6aa3fbf in box_cfg () at /home/shiny/dev/tarantool/src/box/box.cc:5560
```

[3] Stack if we return -1:

```
  tarantool#5  0x00007f8dc1054d26 in __assert_fail (
      assertion=0x55ad141f08a8 "!diag_is_empty(&fiber->diag)",
      file=0x55ad141f03d8 "./src/lib/core/fiber.c", line=821,
      function=0x55ad141f10a0 <__PRETTY_FUNCTION__.45> "fiber_join_timeout")
      at assert.c:101
  tarantool#6  0x000055ad13c582c2 in fiber_join_timeout (fiber=0x7f8dc0811150,
      timeout=3153600000) at /home/shiny/dev/tarantool/src/lib/core/fiber.c:821
  tarantool#7  0x000055ad13c57f1a in fiber_join (fiber=0x7f8dc0811150)
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:768
  tarantool#8  0x000055ad13b15b11 in applier_stop (applier=0x55ad14513a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2752
  tarantool#9  0x000055ad13b01189 in replication_shutdown ()
```
nshy added a commit to nshy/tarantool that referenced this pull request Jan 10, 2024
There is issue with graceful replication shutdown. A good example (on
which `replication/shutdown_test.lua` is based) is bootstrapping
a replica with wrong auth in replication URI. In this case applier is
sleeping in reconnect delay and bootstrap code waiting for READY state.
Now comes server shutdown.

Applier is stopped during shutdown and we hit assertion [1]. The issue
is we miss bootstrap fiber notification that applier fiber is cancelled.
We can fix that but then another issue arise. Bootstrap fiber steals
applier fiber diag in `applier_wait_for_state` and later join in
`applier_stop` hit assertion as diag is expected.

We can fix it as well by copying error in `applier_wait_for_state`
instead of stealing. And it looks like simple solution. But approach has
some drawbacks. First which is already present is that we rely on
dead applier fiber diag in `*.replication.*.upstream.message`
statistic in case applier is not yet stopped. Second which newly arises
is that during `applier_stop` we overwrite current diag due to applier
fiber cancel and join.

So let's instead keep diag in `applier->diag`. Also let's copy error
in `applier_wait_for_state` so that fiber error is not disappeared
in case of bootstrap.

Let's also do not pause applier fiber in `applier_on_state_f` in
case applier is stopped/off. AFAIU we do it only to keep diag which
will be cleared on return from fiber with 0 result. Now we don't
need this. Also if fiber is cancelled this logic does not work anyway
and we'd better have single logic for any type of error path.

Also looks like we return -1 from applier fiber only to keep diag to be
shown by `box.info.replication` as we never use the result of this fiber
join. Now when we get this info from `applier->diag` we can return
0 as in other places.

Part of tarantool#8423

NO_CHANGELOG=fixing unreleased issue
NO_DOC=bugfix

[1] Issue assertion stack:
```
  tarantool#5  0x00007fe877a54d26 in __assert_fail (
      assertion=0x5637b683b07c "fiber() == applier->fiber",
      file=0x5637b683a03e "./src/box/applier.cc", line=2809,
      function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60,
      event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0,
      event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60,
      state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83
  tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60)
      at /home/shiny/dev/tarantool/src/box/applier.cc:2749
  tarantool#13 0x00005637b62dc189 in replication_shutdown ()
```
nshy added a commit to nshy/tarantool that referenced this pull request Jan 11, 2024
This implies finishing replication TX fibers and stopping applier
threads. This is easy to do using existing applier_stop.

We also need to make sure that there is no client fibers are in
replication code after shutdown. Otherwise we may have difficulties
(assertions) while freeing replication resources. This goal have two
sides. First we have to finish client fibers waiting in replication code
and second we should not allow to wait after shutdonw is done.

Here we probably can achieve first side by just stopping appliers. But in
this case client will get error other than FiberIsCancelled which is
nice to have. So approach is to track client fibers in replication code
and cancel them on shutdown. This approach is also aligned with
iproto/relay shutdown.

There is issue with graceful replication shutdown though. A good example
(on which `replication/shutdown_test.lua` is based) is bootstrapping
a replica with wrong auth in replication URI. In this case applier is
sleeping in reconnect delay and bootstrap code waiting for READY state.
Now comes server shutdown.

Applier is stopped during shutdown and we hit assertion [1]. The issue
is we miss bootstrap fiber notification that applier fiber is cancelled.
That's why the change with `fiber_testcancel` in `applier_f`.

We also drop the assertion in the `replica_on_applier_sync` because
applier can switch to OFF state from any previous state if we cancel
applier fiber.

Part of tarantool#8423

[1] Issue assertion stack:
```
  tarantool#5  0x00007fe877a54d26 in __assert_fail (
    assertion=0x5637b683b07c "fiber() == applier->fiber",
    file=0x5637b683a03e "./src/box/applier.cc", line=2809,
    function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101
  tarantool#6  0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/box/applier.cc:2809
  tarantool#7  0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60,
    event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845
  tarantool#8  0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0,
    event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100
  tarantool#9  0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133
  tarantool#10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173
  tarantool#11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60,
    state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83
  tarantool#12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/box/applier.cc:2749
  tarantool#13 0x00005637b62dc189 in replication_shutdown ()
```

NO_CHANGELOG=internal
NO_DOC=internal
locker pushed a commit that referenced this pull request Jan 12, 2024
This implies finishing replication TX fibers and stopping applier
threads. This is easy to do using existing applier_stop.

We also need to make sure that there is no client fibers are in
replication code after shutdown. Otherwise we may have difficulties
(assertions) while freeing replication resources. This goal have two
sides. First we have to finish client fibers waiting in replication code
and second we should not allow to wait after shutdonw is done.

Here we probably can achieve first side by just stopping appliers. But in
this case client will get error other than FiberIsCancelled which is
nice to have. So approach is to track client fibers in replication code
and cancel them on shutdown. This approach is also aligned with
iproto/relay shutdown.

There is issue with graceful replication shutdown though. A good example
(on which `replication/shutdown_test.lua` is based) is bootstrapping
a replica with wrong auth in replication URI. In this case applier is
sleeping in reconnect delay and bootstrap code waiting for READY state.
Now comes server shutdown.

Applier is stopped during shutdown and we hit assertion [1]. The issue
is we miss bootstrap fiber notification that applier fiber is cancelled.
That's why the change with `fiber_testcancel` in `applier_f`.

We also drop the assertion in the `replica_on_applier_sync` because
applier can switch to OFF state from any previous state if we cancel
applier fiber.

Part of #8423

[1] Issue assertion stack:
```
  #5  0x00007fe877a54d26 in __assert_fail (
    assertion=0x5637b683b07c "fiber() == applier->fiber",
    file=0x5637b683a03e "./src/box/applier.cc", line=2809,
    function=0x5637b683b05f "void applier_pause(applier*)") at assert.c:101
  #6  0x00005637b62f0f20 in applier_pause (applier=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/box/applier.cc:2809
  #7  0x00005637b62f104b in applier_on_state_f (trigger=0x7fe877380a60,
    event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/box/applier.cc:2845
  #8  0x00005637b645d2e3 in trigger_run_list (list=0x7fe877280de0,
    event=0x5637b7a87a60) at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:100
  #9  0x00005637b645d401 in trigger_run (list=0x5637b7a88098, event=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/lib/core/trigger.cc:133
  #10 0x00005637b62e4d6b in trigger_run_xc (list=0x5637b7a88098, event=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/lib/core/trigger.h:173
  #11 0x00005637b62e703a in applier_set_state (applier=0x5637b7a87a60,
    state=APPLIER_OFF) at /home/shiny/dev/tarantool/src/box/applier.cc:83
  #12 0x00005637b62f0ab3 in applier_stop (applier=0x5637b7a87a60)
    at /home/shiny/dev/tarantool/src/box/applier.cc:2749
  #13 0x00005637b62dc189 in replication_shutdown ()
```

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 8, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot. Let's shutdown this fiber
too.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 9, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot. Let's shutdown this fiber
too.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 14, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot. Let's shutdown this fiber
too.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 15, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot. Let's shutdown this fiber
too.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 16, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot because of cord making
snaphshot. Let's shutdown this fiber too.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 16, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot because of cord making
snaphshot. Let's just trigger making snapshot in gc subsystem in it's
own worker fiber.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
nshy added a commit to nshy/tarantool that referenced this pull request Feb 19, 2024
We create a snapshot on SIGUSR1 signal in a newly spawned system fiber.
It can interfere with Tarantool shutdown. In particular there is an
assertion on shutdown during such a snapshot because of cord making
snaphshot. Let's just trigger making snapshot in gc subsystem in it's
own worker fiber.

```
  tarantool#5  0x00007e7ec9a54d26 in __assert_fail (
      assertion=0x63ad06748400 "pm_atomic_load(&cord_count) == 0",
      file=0x63ad067478b8 "./src/lib/core/fiber.c", line=2290,
      function=0x63ad06748968 <__PRETTY_FUNCTION__.6> "fiber_free") at assert.c:101
  tarantool#6  0x000063ad061a6a91 in fiber_free ()
      at /home/shiny/dev/tarantool/src/lib/core/fiber.c:2290
  tarantool#7  0x000063ad05edc216 in tarantool_free ()
      at /home/shiny/dev/tarantool/src/main.cc:632
  tarantool#8  0x000063ad05edd144 in main (argc=1, argv=0x63ad079ca3b0)
```

Part of tarantool#8423

NO_CHANGELOG=internal
NO_DOC=internal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants