Skip to content

fix(tikv): adjust timeout to get store in prestop hook#6534

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:feature/v2from
liubog2008:liubo02/fix-tikv-prestop
Nov 5, 2025
Merged

fix(tikv): adjust timeout to get store in prestop hook#6534
ti-chi-bot[bot] merged 2 commits intopingcap:feature/v2from
liubog2008:liubo02/fix-tikv-prestop

Conversation

@liubog2008
Copy link
Member

When the whole cluster is deleted, the default graceful terminating seconds(30s) are all wasted by getting store.

  • adjust default terminationGracePeriod to 60s
  • adjust default get store timeout to 15s
  • limit retry times of getting leader count to 5

Signed-off-by: liubo02 <liubo02@pingcap.com>
Signed-off-by: liubo02 <liubo02@pingcap.com>
@github-actions github-actions bot added the v2 for operator v2 label Nov 5, 2025
@ti-chi-bot ti-chi-bot bot added the size/S label Nov 5, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.99%. Comparing base (6f11e0d) to head (9999278).

Additional details and impacted files
@@             Coverage Diff             @@
##           feature/v2    #6534   +/-   ##
===========================================
  Coverage       40.99%   40.99%           
===========================================
  Files             366      366           
  Lines           20295    20295           
===========================================
  Hits             8320     8320           
  Misses          11975    11975           
Flag Coverage Δ
unittest 40.99% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fgksgf fgksgf requested a review from Copilot November 5, 2025 08:00
@fgksgf
Copy link
Member

fgksgf commented Nov 5, 2025

/lgtm

@ti-chi-bot ti-chi-bot bot added the lgtm label Nov 5, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Nov 5, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-11-05 08:02:43.296891829 +0000 UTC m=+257212.739921708: ☑️ agreed by fgksgf.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Nov 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fgksgf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Nov 5, 2025
@ti-chi-bot ti-chi-bot bot merged commit d3543d0 into pingcap:feature/v2 Nov 5, 2025
13 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR increases the grace period timeout for TiKV pod deletion and adds retry logic to handle transient errors during the prestop hook execution. The changes aim to provide more time for TiKV to gracefully shut down by allocating 30 seconds each for the prestop hook and TiKV itself.

  • Doubled the default grace period from 30s to 60s to accommodate both prestop hook (30s) and TiKV shutdown (30s)
  • Reduced the timeout for finding the store ID from 30s to 15s in the prestop hook
  • Added retry logic with a maximum of 5 retries for transient errors when fetching leader count

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/controllers/tikv/tasks/finalizer.go Increased defaultGracePeriod from 30 to 60 seconds with updated comment explaining the allocation
cmd/prestop-checker/main.go Added maxRetryTimes constant, reduced store lookup timeout from 30s to 15s, and implemented retry logic for transient leader count fetch errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

leaderCount, err := cfg.Client.GetLeaderCount()
if err != nil {
fmt.Printf("cannot get leader count, try again: %v\n", err)
fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message displays retryTimes before incrementing it on line 192, which will show the count starting from 0 instead of 1. Consider moving the increment before the log statement or adjusting the display to show retryTimes+1 for clarity.

Suggested change
fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err)
fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes+1, err)

Copilot uses AI. Check for mistakes.
if err != nil {
fmt.Printf("cannot get leader count, try again: %v\n", err)
fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err)
retryTimes += 1
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use the increment operator retryTimes++ instead of retryTimes += 1 for consistency with Go idiomatic style.

Suggested change
retryTimes += 1
retryTimes++

Copilot uses AI. Check for mistakes.
fmt.Printf("cannot get leader count, try again, retryTimes %v: %v\n", retryTimes, err)
retryTimes += 1
if retryTimes > maxRetryTimes {
return false, fmt.Errorf("retry too many times")
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message 'retry too many times' is vague and doesn't provide context about what operation failed. Consider a more descriptive message like 'failed to get leader count after %d retries' to help with debugging.

Suggested change
return false, fmt.Errorf("retry too many times")
return false, fmt.Errorf("failed to get leader count after %d retries", retryTimes)

Copilot uses AI. Check for mistakes.
liubog2008 added a commit to liubog2008/tidb-operator that referenced this pull request Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants