fix(tikv): adjust timeout to get store in prestop hook#6534
fix(tikv): adjust timeout to get store in prestop hook#6534ti-chi-bot[bot] merged 2 commits intopingcap:feature/v2from
Conversation
Signed-off-by: liubo02 <liubo02@pingcap.com>
Signed-off-by: liubo02 <liubo02@pingcap.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/v2 #6534 +/- ##
===========================================
Coverage 40.99% 40.99%
===========================================
Files 366 366
Lines 20295 20295
===========================================
Hits 8320 8320
Misses 11975 11975
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/lgtm |
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fgksgf The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull Request Overview
This PR increases the grace period timeout for TiKV pod deletion and adds retry logic to handle transient errors during the prestop hook execution. The changes aim to provide more time for TiKV to gracefully shut down by allocating 30 seconds each for the prestop hook and TiKV itself.
- Doubled the default grace period from 30s to 60s to accommodate both prestop hook (30s) and TiKV shutdown (30s)
- Reduced the timeout for finding the store ID from 30s to 15s in the prestop hook
- Added retry logic with a maximum of 5 retries for transient errors when fetching leader count
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/controllers/tikv/tasks/finalizer.go | Increased defaultGracePeriod from 30 to 60 seconds with updated comment explaining the allocation |
| cmd/prestop-checker/main.go | Added maxRetryTimes constant, reduced store lookup timeout from 30s to 15s, and implemented retry logic for transient leader count fetch errors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| leaderCount, err := cfg.Client.GetLeaderCount() | ||
| if err != nil { | ||
| fmt.Printf("cannot get leader count, try again: %v\n", err) | ||
| fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err) |
There was a problem hiding this comment.
The log message displays retryTimes before incrementing it on line 192, which will show the count starting from 0 instead of 1. Consider moving the increment before the log statement or adjusting the display to show retryTimes+1 for clarity.
| fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err) | |
| fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes+1, err) |
| if err != nil { | ||
| fmt.Printf("cannot get leader count, try again: %v\n", err) | ||
| fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err) | ||
| retryTimes += 1 |
There was a problem hiding this comment.
[nitpick] Use the increment operator retryTimes++ instead of retryTimes += 1 for consistency with Go idiomatic style.
| retryTimes += 1 | |
| retryTimes++ |
| fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err) | ||
| retryTimes += 1 | ||
| if retryTimes > maxRetryTimes { | ||
| return false, fmt.Errorf("retry too many times") |
There was a problem hiding this comment.
The error message 'retry too many times' is vague and doesn't provide context about what operation failed. Consider a more descriptive message like 'failed to get leader count after %d retries' to help with debugging.
| return false, fmt.Errorf("retry too many times") | |
| return false, fmt.Errorf("failed to get leader count after %d retries", retryTimes) |
When the whole cluster is deleted, the default graceful terminating seconds(30s) are all wasted by getting store.