-
Notifications
You must be signed in to change notification settings - Fork 253
Description
I found this issue in #970
Topology
Version: nightly
+------+--------------------+--------------------+--------------------------+------------------------------------------+---------------------------+---------------------+-----------+
| TYPE | INSTANCE | STATUS_ADDRESS | VERSION | GIT_HASH | START_TIME | UPTIME | SERVER_ID |
+------+--------------------+--------------------+--------------------------+------------------------------------------+---------------------------+---------------------+-----------+
| tidb | 192.168.78.3:23400 | 192.168.78.3:23080 | 7.4.0-alpha-386-gf1067a1 | f1067a1b761d5bae74510fc715f28814a633352d | 2023-09-13T11:23:19+08:00 | 1h8m5.142420463s | 634 |
| tidb | 192.168.78.4:23400 | 192.168.78.4:23080 | 7.4.0-alpha-386-gf1067a1 | f1067a1b761d5bae74510fc715f28814a633352d | 2023-09-13T11:23:42+08:00 | 1h7m42.142422497s | 970 |
| tidb | 192.168.78.2:23400 | 192.168.78.2:23080 | 7.4.0-alpha-386-gf1067a1 | f1067a1b761d5bae74510fc715f28814a633352d | 2023-09-13T11:22:58+08:00 | 1h8m26.1424236s | 559 |
| pd | 192.168.78.3:23379 | 192.168.78.3:23379 | 7.4.0-alpha | 58113f8d17ccec4690445dad2e47cadbacd42da3 | 2023-09-12T15:45:50+08:00 | 20h45m34.142424456s | 0 |
| pd | 192.168.78.2:23379 | 192.168.78.2:23379 | 7.4.0-alpha | 58113f8d17ccec4690445dad2e47cadbacd42da3 | 2023-09-12T15:45:50+08:00 | 20h45m34.142425424s | 0 |
| pd | 192.168.78.4:23379 | 192.168.78.4:23379 | 7.4.0-alpha | 58113f8d17ccec4690445dad2e47cadbacd42da3 | 2023-09-12T15:45:50+08:00 | 20h45m34.142426465s | 0 |
| tikv | 192.168.80.4:23360 | 192.168.80.4:23380 | 7.4.0-alpha | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142427254s | 0 |
| tikv | 192.168.80.3:23360 | 192.168.80.3:23380 | 7.4.0-alpha | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142427995s | 0 |
| tikv | 192.168.80.2:23361 | 192.168.80.2:23381 | 7.4.0-alpha | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142428715s | 0 |
| tikv | 192.168.80.2:23360 | 192.168.80.2:23380 | 7.4.0-alpha | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142429364s | 0 |
| tikv | 192.168.80.3:23361 | 192.168.80.3:23381 | 7.4.0-alpha | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142431041s | 0 |
| tikv | 192.168.80.4:23361 | 192.168.80.4:23381 | 7.4.0-alpha | 98eb383b41695b11a03e3d1ce471181f02bfc741 | 2023-09-12T15:45:52+08:00 | 20h45m32.142431848s | 0 |
+------+--------------------+--------------------+--------------------------+------------------------------------------+---------------------------+---------------------+-----------+Inject network latency
Inject 1000ms network delay in all TiDB nodes to 192.168.80.2, which deploy TiKV 192.168.80.2:23360 and 192.168.80.2:23361
[root@192.168.78.2]# ping 192.168.80.2
PING 192.168.80.2 (192.168.80.2) 56(84) bytes of data.
64 bytes from 192.168.80.2: icmp_seq=1 ttl=62 time=546 ms
64 bytes from 192.168.80.2: icmp_seq=2 ttl=62 time=217 ms
64 bytes from 192.168.80.2: icmp_seq=3 ttl=62 time=145 ms
64 bytes from 192.168.80.2: icmp_seq=4 ttl=62 time=500 ms
64 bytes from 192.168.80.2: icmp_seq=5 ttl=62 time=1091 ms
[root@192.168.78.2]# ping 192.168.80.3
PING 192.168.80.3 (192.168.80.3) 56(84) bytes of data.
64 bytes from 192.168.80.3: icmp_seq=1 ttl=62 time=0.086 ms
64 bytes from 192.168.80.3: icmp_seq=2 ttl=62 time=0.079 ms
64 bytes from 192.168.80.3: icmp_seq=3 ttl=62 time=0.082 ms
[root@192.168.78.2]# ping 192.168.80.4
PING 192.168.80.4 (192.168.80.4) 56(84) bytes of data.
64 bytes from 192.168.80.4: icmp_seq=1 ttl=62 time=0.088 ms
64 bytes from 192.168.80.4: icmp_seq=2 ttl=62 time=0.072 ms
64 bytes from 192.168.80.4: icmp_seq=3 ttl=62 time=0.058 ms
192.168.78.3 and 192.168.78.4 are same with upper too.
Test
Load: stale-read request
tidb_kv_read_timeout: 100
max_execution_time: 400
sysbench --config-file=sysbench.conf mussel_minimal.lua --threads=40 --read_staleness=-10 --kv_read_timeout=100 --max_execution_time=400 --mysql-ignore-errors=all runAnd run more load in TiDB 192.168.78.2 :
sysbench --config-file=sysbench.conf2 mussel_minimal.lua --threads=100 --read_staleness=-10 --kv_read_timeout=100 --max_execution_time=400 --mysql-ignore-errors=all run$ cat sysbench.conf
mysql-host=192.168.78.2,192.168.78.3,192.168.78.4
...
...
$ cat sysbench.conf2
mysql-host=192.168.78.2
The following are related metrics, some of metrics here are not as expected:
- QPS is often jitter. This is caused by
DataNotReadybackoff.
DataNotReadyis cause by lock. Following is TiKV log:
- Chose a ts
444257405948395529from upper log to search in TiDB slow query, which is a TiDB internal query, from statistic. - This Insert statement execution too slow, and the transaction prewrite(27.7s), commit(1.2min) are very slow too.
the related slow log is following:
# Time: 2023-09-15T00:08:07.852305922+08:00
# Txn_start_ts: 444257405948395529
# Query_time: 176.126226988
# Parse_time: 0
# Compile_time: 0.000114942
# Rewrite_time: 0.000026448
# Optimize_time: 0.000279455
# Wait_TS: 0.000249822
# Prewrite_time: 27.688009498 Commit_time: 74.951987366 Get_commit_ts_time: 0.000345107 Commit_backoff_time: 0.002 Commit_Backoff_types: [regionMiss] Slowest_prewrite_rpc_detail: {total:27.688s, region_id: 10837, store: 192.168.80.2:23360, tikv_wall_time: 862.5µs, scan_detail: {get_snapshot_time: 14.5µs, rocksdb: {block: {cache_hit_count: 4}}}, write_detail: {store_batch_wait: 33.5µs, propose_send_wait: 0s, persist_log: {total: 106.3µs, write_leader_wait: 221ns, sync_log: 53.5µs, write_memtable: 2.766
µs}, commit_log: 400.7µs, apply_batch_wait: 30.8µs, apply: {total:129.7µs, mutex_lock: 0s, write_leader_wait: 0s, write_wal: 31.3µs, write_memtable: 19.2µs}}} Commit_primary_rpc_detail: {total:27.688s, region_id: 10837, store: 192.168.80.2:23360, tii
kv_wall_time: 862.5µs, scan_detail: {get_snapshot_time: 14.5µs, rocksdb: {block: {cache_hit_count: 4}}}, write_detail: {store_batch_wait: 33.5µs, propose_send_wait: 0s, persist_log: {total: 106.3µs, write_leader_wait: 221ns, sync_log: 53.5µs, write_mm
emtable: 2.76µs}, commit_log: 400.7µs, apply_batch_wait: 30.8µs, apply: {total:129.7µs, mutex_lock: 0s, write_leader_wait: 0s, write_wal: 31.3µs, write_memtable: 19.2µs}}} Write_keys: 1 Write_size: 66 Prewrite_region: 1
# Is_internal: true
# Digest: 372e2010ba3b46891b9c8ec8b2264af8cd2d88aa657b800cece6decc96e1d917
# Num_cop_tasks: 0
# Mem_max: 8447
# Prepared: false
# Plan_from_cache: false
# Plan_from_binding: false
# Has_more_results: false
# KV_total: 176.122251768
# PD_total: 0.000593377
# Backoff_total: 0.002
# Write_sql_response_total: 0
# Result_rows: 0
# Succ: true
# IsExplicitTxn: false
# IsSyncStatsFailed: false
# Plan: tidb_decode_plan('4AnwTDAJMjdfMQkwCTAJTi9BCTAJdGltZToxbTEzLjVzLCBsb29wczoxLCBwcmVwYXJlOiAxN8K1cywgY2hlY2tfaW5zZXJ0OiB7dG90YWxfBTwAIBU9CG1lbQ0iAF8NGgwwMS43BT8ccHJlZmV0Y2gdLkhycGM6e0JhdGNoR2V0OntudW1fARMIMiwgHV0NXCx9LCB0aWt2X3dhbGwNXEQyLjltcywgc2Nhbl9kZXRhaWwVlzRwcm9jZXNzX2tleXM6IBVRLhcAJF9zaXplOiAxNjcRbxUtKGdldF9zbmFwc2hvFcgIOS4yBcfwRnJvY2tzZGI6IHtibG9jazoge2NhY2hlX2hpdF9jb3VudDogN319fX19LCBjb21taXRfdHhuOiB7cHJld3JpdGU6MjcuN3MsBWQRIRhzOjM0NS4xBWAJFAH2IY0YYmFja29mZgHYFGltZTogMgHxCSN4IHR5cGU6IFtyZWdpb25NaXNzXX0sIHNsb3dlc3RfcA1vJVEEIHslUCg6IDI3LjY4OHMsIAk1OF9pZDogMTA4MzcsIHN0byXsQDkyLjE2OC44MC4yOjIzMzYwRngBEDg2Mi41Ba02ewFONwEANA0qijcBBDR9ITUlJTnPBZcEX2JBJCBfd2FpdDogMzMNUyxwcm9wb3NlX3NlbmQNHDgwcywgcGVyc2lzdF9sb2cZ+RAxMDYuMwWzCWAUbGVhZGVyDTYMMjIxbkFICHluYwk2ADURYwktGG1lbXRhYmwhfxAuNzbCtUGMKYQJLwg0MDBN5BBhcHBseTauAAgwLjgFeAUbFZEIMTI5DTMUbXV0ZXhfSUsBvwlzMqAAGRcYd2FsOiAzMTLLABmeTaQlRw2gGHByaW1hcnk2BgIQMTQuOTRBhPIGAhAxOTguNAXbhgYCADYts6IGAgB9VaMUbnVtOjEsLRdlohUOWGJ5dGU6NjZ9CTI1NSBCeXRlcwlOL0EK')
# Binary_plan: tidb_decode_binary_plan('6gnwRgrlCQoISW5zZXJ0XzE4AUABUgNOL0FaFXRpbWU6MW0xMy41cywgbG9vcHM6MWK/AnByZXBhcmU6IDE3wrVzLCBjaGVja19pBUEgOiB7dG90YWxfBT0AIBU+EG1lbV9pBSIRGgwwMS43BT8ccHJlZmV0Y2gdLkhycGM6e0JhdGNoR2V0OntudW1fARMIMiwgHV0NXCh9LCB0aWt2X3dhbBF2RDIuOW1zLCBzY2FuX2RldGFpbBWXNHByb2Nlc3Nfa2V5czogFVEuFwAkX3NpemU6IDE2NxFvFS0oZ2V0X3NuYXBzaG8VyAg5LjIFx/BPcm9ja3NkYjoge2Jsb2NrOiB7Y2FjaGVfaGl0X2NvdW50OiA3fX19fX1i6AZjb21taXRfdHhuOiB7cHJld3JpdGU6MjcuN3MsIGdldF9jb20FIRhzOjM0NS4xBWEJFAH3KDVzLCBiYWNrb2ZmAdkUaW1lOiAyAfIJI3QgdHlwZTogW3JlZ2lvbk1pc3NdfSwgc2xvd2VzdF8RbyVSLa4oOiAyNy42ODhzLCAJNThfaWQ6IDEwODM3LCBzdG8l7UA5Mi4xNjguODAuMjoyMzM2MEZ5ARA4NjIuNQWtNnwBTjgBADQNKoo4ARg0fX19LCB3ISU50AWXBF9iQSUgX3dhaXQ6IDMzDVMscHJvcG9zZV9zZW5kDRw4MHMsIHBlcnNpc3RfbG9nGfkQMTA2LjMFswlgFGxlYWRlcg02DDIyMW5BSQh5bmMJNgA1EWMJLRhtZW10YWJsIX8QLjc2wrVBjSmECS8INDAwTeUQYXBwbHk2rgAIMC44BXgFGxWRCDEyOQ0zFG11dGV4X0lMAb8JczKgABkXGHdhbDogMzEyywAZnk2lJUcNoBhwcmltYXJ5NgYCGDE0Ljk0N3P6BgIQMTk4LjQF24YGAgA2LbOiBgIAfRWdEG51bToxURdloxUOXGJ5dGU6NjZ9cP8BeP///////////wEYAQ==')
insert into mysql.tidb (variable_name, variable_value) values ('tidb_stats_gc_last_ts', 444257236350140416) on duplicate key update variable_value = 444257236350140416;
The slow query execution plan is following:
id task estRows operator info actRows execution info memory disk
Insert_1 root 0 N/A 0 time:1m13.5s, loops:1, prepare: 17µs, check_insert: {total_time: 1m13.5s, mem_insert_time: 101.7µs, prefetch: 1m13.5s, rpc:{BatchGet:{num_rpc:2, total_time:1m13.5s}, tikv_wall_time: 2.9ms, scan_detail: {total_process_keys: 2, total_process_keys_size: 167, total_keys: 2, get_snapshot_time: 19.2µs, rocksdb: {block: {cache_hit_count: 7}}}}}, commit_txn: {prewrite:27.7s, get_commit_ts:345.1µs, commit:1m15s, backoff: {time: 2ms, commit type: [regionMiss]}, slowest_prewrite_rpc: {total: 27.688s, region_id: 10837, store: 192.168.80.2:23360, tikv_wall_time: 862.5µs, scan_detail: {get_snapshot_time: 14.5µs, rocksdb: {block: {cache_hit_count: 4}}}, write_detail: {store_batch_wait: 33.5µs, propose_send_wait: 0s, persist_log: {total: 106.3µs, write_leader_wait: 221ns, sync_log: 53.5µs, write_memtable: 2.76µs}, commit_log: 400.7µs, apply_batch_wait: 30.8µs, apply: {total:129.7µs, mutex_lock: 0s, write_leader_wait: 0s, write_wal: 31.3µs, write_memtable: 19.2µs}}}, commit_primary_rpc: {total: 14.947s, region_id: 10837, store: 192.168.80.2:23360, tikv_wall_time: 198.4µs, scan_detail: {get_snapshot_time: 16.5µs, rocksdb: {block: {cache_hit_count: 4}}}, }, region_num:1, write_keys:1, write_byte:66} 255 Bytes N/A
Then I add some metrics ( #973 ) in batch-client, this is the root cause to make Insert statement execute too slowly.
In the following metric, since I inject around 1 second network latency between 192.168.78.2 and 192.168.80.2, so the max batch conn send duration is 15s.
In the following metric, the max got-resp is 1.14min, and max wait-to-send is 16s, this is unexpected.
