improve region request log for diagnose#1300
Conversation
Signed-off-by: crazycs520 <crazycs520@gmail.com>
d1670c1 to
d6c8e6e
Compare
ekexium
left a comment
There was a problem hiding this comment.
LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later. |
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
…e-log-and-rpc-stats
internal/locate/region_request.go
Outdated
| // ReplicaAccessStats records the first 20 access info, after more than 20 records, count them by `OverflowCount` field. | ||
| type ReplicaAccessStats struct { | ||
| AccessInfos []ReplicaAccessInfo | ||
| OverflowCount int |
There was a problem hiding this comment.
How about merging the access infor by peer when there are many access following? Overflowcount could not provide useful information.
There was a problem hiding this comment.
Since ReplicaAccessStats currently only records for this round, and in normal case, each replica should only access only once, for leader replica can be access 10 times, so normally, OverflowCount should always be 0.
There was a problem hiding this comment.
What about the follower read or stale read cases? The OverflowCount still does not seem very helpful.
Consider such a case, the NotLeader error is encountered serveral times because of region election, after the new leader information is returned to the kv-client, some new errors caused by disk io problems are encountered, while the related information is recorded as OverflowCount .
There was a problem hiding this comment.
great, done. PTAL
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <crazycs520@gmail.com>
| if replicaRead { | ||
| tp = ReqReplicaRead | ||
| } else if staleRead { |
There was a problem hiding this comment.
Should replicaRead be prior to staleRead? Or is it guaranteed that replicaRead and staleRead must be exclusive?
There was a problem hiding this comment.
According to the current tests, normally, replicaRead and staleRead flag must be exclusive.
* improve region request log for diagnose Signed-off-by: crazycs520 <crazycs520@gmail.com> * rename struct Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine region error metric with store id label and add rpc error metric Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine comment Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * restrict log Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine log Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * fix test Signed-off-by: crazycs520 <crazycs520@gmail.com> * address comment Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine log Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>
* improve region request log for diagnose * rename struct * refine region error metric with store id label and add rpc error metric * refine comment * refine code * restrict log * refine code * refine * refine * refine log * refine code * fix test * address comment * refine * refine * refine log --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>

Close #1315.
What changed and how does it work?
1. Reduce
region_requestlogBefore This PR
After 1 tikv server down when benchmark test, tidb will output tens of thousands log about
region_requestin a matter of minutesThis PR
2. make log more friendly to diagnostic slow queries.
Such as use
Connection IDandtxn timestampto grep tidb log, then got following log:With upper log information, we can know following information about this query:
3. Add RPC Error stats.
Before This PR, the plan with execution info may be like following:
We can know
TableReader_17execution info is:time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}This PR
We can know
TableReader_17execution info is:time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , rpc_errors:{data_is_not_ready:2, not_leader:2, context deadline exceeded:1, context canceled:1}, backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}rpc_errorsinfo, we can known the cop task meetsdata_is_not_readyerror 2 times, meetsnot_leadererror 2 times, meetscontext deadline exceeded1 times, and meetscontext canceled1 time.4. Refine Region Error metric and add RPC error metric