Skip to content

test: flaky replication/election_qsync.test.lua test #5430

@avtikhon

Description

@avtikhon

Tarantool version:
Tarantool 2.6.0-208-ga20a04cba
Target: FreeBSD-amd64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=OFF
Compiler: /usr/bin/cc /usr/bin/c++
C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-common -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror
CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-common -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror

OS version:
FreeBSD 12

Bug description:

  1. https://gitlab.com/tarantool/tarantool/-/jobs/795916093#L5385

artifacts.zip

results file checksum: afaa5d0f392c8de5420a05b268d04741

[013] --- replication/election_qsync.result	Fri Oct 16 19:25:53 2020
[013] +++ replication/election_qsync.reject	Fri May  8 08:21:42 2020
[013] @@ -127,14 +127,15 @@
[013]   | ...
[013]  _ = box.space.test:replace{2}
[013]   | ---
[013] + | - error: Found uncommitted sync transactions from other instance with id 2
[013]   | ...
[013]  box.space.test:select{}
[013]   | ---
[013]   | - - [1]
[013] - |   - [2]
[013]   | ...
[013]  box.space.test:drop()
[013]   | ---
[013] + | - error: Found uncommitted sync transactions from other instance with id 2
[013]   | ...
[013]  
[013]  test_run:cmd('delete server replica')
[013] 
  1. manually reproduced - check below (happens as 4th error)

results file checksum: 1c68d778b6e78dff37609b55d536cfd8

[001] replication/election_qsync.test.lua             vinyl           [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] --- replication/election_qsync.result     Fri May  8 08:56:08 2020
[001] +++ var/rejects/replication/election_qsync.reject Thu Jun 17 08:46:44 2021
[001] @@ -145,8 +145,7 @@
[001]   | ...
[001]  box.space.test:select{}
[001]   | ---
[001] - | - - [1]
[001] - |   - [2]
[001] + | - - [2]
[001]   | ...
[001]  box.space.test:drop()
[001]   | ---
[001]
  1. manually run on FreeBSD VMware image found that it fails running the test alone (happens very rare):
[001] --- replication/election_qsync.result     Fri May  8 08:56:08 2020
[001] +++ var/rejects/replication/election_qsync.reject Sat Jun 19 12:00:47 2021
[001] @@ -142,11 +142,11 @@
[001]   | ...
[001]  _ = box.space.test:replace{2}
[001]   | ---
[001] + | - error: A rollback for a synchronous transaction is received
[001]   | ...
[001]  box.space.test:select{}
[001]   | ---
[001] - | - - [1]
[001] - |   - [2]
[001] + | - []
[001]   | ...
[001]  box.space.test:drop()
[001]   | ---
[001] 
  1. manually run on FreeBSD VMware image found that it fails running the test alone with assert (happens in *0%):
    [001] Assertion failed: (lsn > prev_lsn), function vclock_follow, file /home/vagrant/tarantool/src/lib/vclock/vclock.c, line 46.
[001] replication/election_qsync.test.lua             memtx           
[001] 
[001] [Instance "replica" killed by signal: 6 (SIGABRT)]
[001] 
[001] Last 15 lines of Tarantool Log file [Instance "replica"][/home/vagrant/tarantool/test/var/001_replication/replica.log]:
[001] 2021-06-19 09:52:59.876 [67205] main/112/applier/unix/:/home/vagrant/tarantool/test/var/001_replication/master.socket-iproto I> RAFT: message {term: 1, state: follower} from 1
[001] 2021-06-19 09:52:59.892 [67205] main/127/console/unix/: I> set 'replication_synchro_timeout' configuration option to 1000000
[001] 2021-06-19 09:52:59.892 [67205] main/127/console/unix/: I> set 'replication_synchro_quorum' configuration option to 3
[001] 2021-06-19 09:52:59.892 [67205] main/127/console/unix/: I> RAFT: start state machine
[001] 2021-06-19 09:52:59.892 [67205] main/127/console/unix/: I> set 'election_mode' configuration option to "candidate"
[001] 2021-06-19 09:52:59.892 [67205] main/127/console/unix/: I> set 'election_timeout' configuration option to 1000000
[001] 2021-06-19 09:53:00.295 [67205] main I> RAFT: begin new election round
[001] 2021-06-19 09:53:00.295 [67205] main I> RAFT: bump term to 2, follow
[001] 2021-06-19 09:53:00.295 [67205] main I> RAFT: vote for 2, follow
[001] 2021-06-19 09:53:00.296 [67205] main/128/raft_worker I> RAFT: persisted state {term: 2, vote: 2}
[001] 2021-06-19 09:53:00.296 [67205] main/128/raft_worker I> RAFT: enter candidate state with 1 self vote
[001] 2021-06-19 09:53:00.296 [67205] main/112/applier/unix/:/home/vagrant/tarantool/test/var/001_replication/master.socket-iproto I> RAFT: message {term: 2, vote: 2, state: follower} from 1
[001] 2021-06-19 09:53:00.296 [67205] main/112/applier/unix/:/home/vagrant/tarantool/test/var/001_replication/master.socket-iproto I> RAFT: enter leader state with quorum 2
[001] 2021-06-19 09:53:00.307 [67205] relay/unix/:(socket)/101/main I> recover from `/home/vagrant/tarantool/test/var/001_replication/replica/00000000000000000002.xlog'
[001] Assertion failed: (lsn > prev_lsn), function vclock_follow, file /home/vagrant/tarantool/src/lib/vclock/vclock.c, line 46.
[001] [ fail ]

Main log:

[2021-06-19 12:05:36.277204] ESC[0;34mDEBUG: sending command: test_run:wait_cond(function() return box.info.lsn > lsn end)ESC[0m
[2021-06-19 12:05:40.958230] ESC[0;34mDEBUG: tarantool's response for [test_run:wait_cond(function() return box.info.lsn > lsn end)]
[2021-06-19 12:05:40.958230]  | 
[2021-06-19 12:05:40.958230] ESC[0m
[2021-06-19 12:05:40.959094] ESC[0;34mDEBUG: sending command: test_run:wait_lsn('default', 'replica')ESC[0m
[2021-06-19 12:05:40.959682] ESC[0;34mDEBUG: tarantool's response for [test_run:wait_lsn('default', 'replica')]
[2021-06-19 12:05:40.959682]  | [Lost current connection]
[2021-06-19 12:05:40.959682] ESC[0m
[2021-06-19 12:05:40.960030] ESC[0;34mDEBUG: sending command: test_run:switch('default')ESC[0m
[2021-06-19 12:05:40.960443] ESC[0;34mDEBUG: tarantool's response for [test_run:switch('default')]
[2021-06-19 12:05:40.960443]  | [Lost current connection]
[2021-06-19 12:05:40.960443] ESC[0m
[2021-06-19 12:05:40.960792] ESC[0;34mDEBUG: sending command: test_run:cmd('stop server replica')ESC[0m
[2021-06-19 12:05:40.961197] ESC[0;34mDEBUG: tarantool's response for [test_run:cmd('stop server replica')]
[2021-06-19 12:05:40.961197]  | [Lost current connection]
[2021-06-19 12:05:40.961197] ESC[0m
[2021-06-19 12:05:40.961532] ESC[0;34mDEBUG: sending command: box.space.test:replace{2}ESC[0m
[2021-06-19 12:05:41.038775] ESC[0;31m
[2021-06-19 12:05:41.038775] 
[2021-06-19 12:05:41.038775] [Instance "replica" killed by signal: 6 (SIGABRT)]ESC[0m

Steps to reproduce:
To reproduce errors 2, 3, 4 just run single test in loop on FreeBSD VMware with command:

c=0 ; while ./test-run.py replication/election_qsync.test.lua ; do c=$(($c+1)) ; echo "ALX ================================= $c" | tee c.count ; done ; echo "ALX ================================= $c"

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions