DNM Add metrics about the last time a region is sent a snapshot#9429
DNM Add metrics about the last time a region is sent a snapshot#9429CalvinNeo wants to merge 10 commits intopingcap:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| tmt.getRegionTable().shrinkRegionRange(*new_region); | ||
| } | ||
|
|
||
| new_region->updateSnapshotAppliedTime(); |
There was a problem hiding this comment.
We always create a new_region instance to apply snapshot instead of modifying the old_region in-place. So I think the tiflash_raft_long_term_event_duration_seconds, type_apply_snapshot_gap is not get reported at all?
There was a problem hiding this comment.
Yes, I have tried to fix this by fetch the lastSNapshotAppliedTime() from the old_region(if any) to compute. However, it is still hard to observe the metric in real tests. Because the only stable way to make TiFlash lag is to kill it. However, if TiFlash restarts, the lastSNapshotAppliedTime() will be set to 0.
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Co-authored-by: jinhelin <linjinhe33@gmail.com>
| is_running = true; | ||
| } | ||
|
|
||
| UInt64 getStartMillis() { return start_ns / 1000000UL; } |
There was a problem hiding this comment.
Abbreviating milliseconds to millis seem a bit strange.
getStartMS may be more clear.
There was a problem hiding this comment.
I would just use Milliseconds because other methods use this too...
| UInt64 regionId() const { return region_id; } | ||
| UInt64 peerId() const { return peer_id; } | ||
| UInt64 beginTime() const { return begin_time; } | ||
| UInt64 createdTime() const { return begin_time; } |
There was a problem hiding this comment.
Should be return created_time?
There was a problem hiding this comment.
And add comments about what is the difference between beginTime and createdTime
There was a problem hiding this comment.
Changed this PR back to DNM, because some new functionalities is added into this PR.
|
After #9434 |
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@CalvinNeo: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |


What problem does this PR solve?
Issue Number: ref #9241
Problem Summary:
The idea is when a region got stuck, TiKV leader could send it a snapshot. We will record how many time has passed since the last snapshot. If the time is short, then the region may have some problems here.
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note