-
Notifications
You must be signed in to change notification settings - Fork 4.1k
multi-node serializability test failures #8057
Description
The partestdata/subquery_retry_multinode test normally takes ~3s with occasional exceptions where it times out.
Looking at a log for one instance, it looks like the actual workload completes in under 3 seconds and we time out during quiescing.
Cluster Size: 5
I160726 18:43:25.060016 storage/engine/rocksdb.go:353 opening in memory rocksdb instance
...
--- done: partestdata/subquery_retry_multinode/final: 2 tests, 0 failures
I160726 18:43:27.670494 stopper.go:408 quiesceing; tasks left:
...
I160726 18:44:20.452298 kv/txn_coord_sender.go:246 txn coordinator: 0.31 txn/sec, 100.00/34.92/0.00/0.00 %cmmt/cmmt1pc/abrt/abnd, 37ms/104ms/511ms avg/σ/max duration, 0.7/1.2/5 avg/σ/max restarts (27 samples)
panic: test timed out after 1m0s
Full log here: https://gist.githubusercontent.com/RaduBerinde/0d6a37b7c6ae8cf1c3112f64dbb0f252/raw/bf55fa3f2e3fa115fc6527173f162c78baee4db8/subquery_retry_multinode.log
The messages suggest that some node(s) may be waiting for other nodes which are already shutdown, we may have a problem in our quiescing/stopping logic for test clusters.
For comparison, this is a log file for a "good" instance where the test completes in 3 seconds: https://gist.githubusercontent.com/RaduBerinde/71adf59997fe8c294985278c34c2b8a3/raw/8611b513db55c1042bfb8e623e3c3a90070a2936/subquery_retry_multinode_ok.log
Description of the workload:
- we have a cluster of 5 nodes
- range split size is set to 16384
- each node repeats this statement 25 times:
INSERT INTO T VALUES ((SELECT MAX(k+1) FROM T), REPEAT('x', 500))
Steps to reproduce:
- apply this diff: https://gist.github.com/RaduBerinde/eaff883c276060a19c6e21b80eccfea4 (edit: onto more recent base https://gist.github.com/RaduBerinde/010ace90cc22a1bfcfe654ff341350c5)
- run
while make test PKG=./sql TESTS=TestParallel TESTTIMEOUT=1m TESTFLAGS='-v --verbosity 1 --partestdata partestdata/subquery_retry_multinode --show-sql' >/tmp/log 2>&1; do date; done