Skip to content

status server: add /ready API#18237

Merged
ti-chi-bot[bot] merged 9 commits intotikv:masterfrom
hbisheng:ready-api
Mar 5, 2025
Merged

status server: add /ready API#18237
ti-chi-bot[bot] merged 9 commits intotikv:masterfrom
hbisheng:ready-api

Conversation

@hbisheng
Copy link
Member

@hbisheng hbisheng commented Feb 21, 2025

What is changed and how it works?

Issue Number: Close #18244

What's Changed:

This PR introduces a /ready API to check TiKV's readiness after 
startup, similar to Kubernetes' readyz API. During rolling upgrades, 
this API helps control plane services determine whether a TiKV instance 
is fully ready to serve, facilitating the decision on whether to 
proceed with upgrading the next instance.

The API currently checks two conditions:
1. The server has successfully uploaded a heartbeat to PD.
2. The server is no longer in the busy apply stage, which means a 
   sufficient number of Raft peers have caught up applying Raft logs.

The `/ready` API supports a `verbose` parameter to return a detailed 
JSON response that shows the status of each condition.

Note: PD has already implemented a similar API in #8749. By adding this
unified interface across TiDB components (TiKV, PD, TiDB), we make the 
readiness check process more consistent and easier for the control plane 
service.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Manual test steps:

  1. Started a cluster locally and ran sysbench traffic.
  2. Stopped a tikv server for a minute and then restarted it.
  3. In the meantime, query the ready API every second.

Observation:
The ready API began returning status 200 after the Raft logs have caught up, with the time aligned with the tikv logs and the "Server is busy" metrics.

[2025/02/28 15:18:26.711 +08:00] [INFO] [lib.rs:89] ["Welcome to TiKV"] [thread_id=1]
[2025/02/28 15:18:36.839 +08:00] [INFO] [pd.rs:1291] ["ServerReadiness: connected to PD"] [thread_id=31]
[2025/02/28 15:19:56.856 +08:00] [INFO] [store.rs:2948] ["ServerReadiness: Raft peers have caught up applying logs"] [thread_id=101]
image

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

Add /ready API to indicate TiKV startup readiness, supporting safer rolling upgrades.

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 21, 2025
@hbisheng hbisheng marked this pull request as draft February 21, 2025 01:49
@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 21, 2025
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@hbisheng hbisheng changed the title Add /ready API to check server readiness after startup status server: add /ready API Feb 24, 2025
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 24, 2025
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@hbisheng hbisheng marked this pull request as ready for review February 24, 2025 06:57
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2025

// A reference for updating the server readiness state, which is queried by
// the status server's /ready API.
pub server_readiness: Arc<ServerReadiness>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to start a discussion.
Seems that, this SeverReadiness can be a global variable instead of a state being propagated through different modules.
What do you think?

Copy link
Member Author

@hbisheng hbisheng Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a global variable sounds feasible because this ServerReadiness struct should only be created once. It will also make the implementation easier by avoiding the propagation. But I'm not sure about the idiomatic way of using global variables in Rust; need to check some examples first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to use a global variable—no need to pass it around any more! Thanks for the suggestion

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@hbisheng
Copy link
Member Author

@v01dstar @LykxSassinator PTAL again, thanks!

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 27, 2025
@hbisheng
Copy link
Member Author

/hold
will add some logs indicating state changes

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2025
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@hbisheng
Copy link
Member Author

hbisheng commented Mar 5, 2025

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2025
Copy link
Contributor

@LykxSassinator LykxSassinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

Co-authored-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@hbisheng
Copy link
Member Author

hbisheng commented Mar 5, 2025

/retest

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LykxSassinator, v01dstar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LykxSassinator,v01dstar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 5, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-02-27 03:30:08.451317153 +0000 UTC m=+499356.404475411: ☑️ agreed by v01dstar.
  • 2025-03-05 07:38:15.120813859 +0000 UTC m=+427208.249733602: ☑️ agreed by LykxSassinator.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

@hbisheng: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot merged commit 335e3b8 into tikv:master Mar 5, 2025
8 checks passed
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone Mar 5, 2025
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 28, 2025
@hbisheng hbisheng deleted the ready-api branch July 14, 2025 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement a /ready API to check server readiness post startup

3 participants