Skip to content

fix tiflash resource control low token signal miss#10140

Merged
ti-chi-bot[bot] merged 23 commits intopingcap:masterfrom
guo-shaoge:fix_token_miss
May 20, 2025
Merged

fix tiflash resource control low token signal miss#10140
ti-chi-bot[bot] merged 23 commits intopingcap:masterfrom
guo-shaoge:fix_token_miss

Conversation

@guo-shaoge
Copy link
Contributor

@guo-shaoge guo-shaoge commented Apr 27, 2025

What problem does this PR solve?

Issue Number: close #10137

Problem Summary:
before

  1. low token notify doesn't work becuase wait_for ignore low token signal
  2. 5s for computing avg speed is too long, which may not reflect the latest ru speed.
  3. acquire token from GAC every 5s

after:

  1. mainLoop thread will compute ru speed every 1s
  2. requestGACLoop thread is dedicated to sending gac thread
  3. only acquire token from GAC when got low token signal
  4. change min_wait_dura_ms from 100ms to 10ms.

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  1. create tidb cluster with 2 tiflash(16c)
  2. load tpch50g workload
  3. run the following scripts: https://github.com/guo-shaoge/tiflash_resource_control_test_tool, detailed test report is internal
    1. high.change.sh

      image

    2. periodically.change.sh

      image

    3. priority.change.sh
      image

    4. burstable.change.sh
      image

  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

fix tiflash resource control low token signal miss

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 27, 2025
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 30, 2025
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
guo-shaoge added 2 commits May 9, 2025 11:01
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge changed the title fix resource control token miss because overwrite tokens in local bucket fix resource control token miss because low token signal is ignored May 9, 2025
@guo-shaoge guo-shaoge changed the title fix resource control token miss because low token signal is ignored fix tiflash resource control low token signal is missed May 9, 2025
guo-shaoge added 2 commits May 9, 2025 15:06
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge requested a review from Copilot May 9, 2025 08:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the missed low token signal in TiFlash resource control by updating the TokenBucket low token threshold calculation, refining time-based functions with explicit timestamps, and reducing the minimum wait duration.

  • Update low token threshold to use initial tokens instead of capacity.
  • Revise TokenBucket functions (consume, peek, reConfig, compact, getDynamicTokens) to accept an explicit timestamp parameter.
  • Reduce min_wait_dura_ms from 100ms to 10ms and update wait duration usage in the scheduling queue.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dbms/src/Flash/ResourceControl/TokenBucket.h Adjusted initialization of low token threshold and updated function signatures to accept timestamps.
dbms/src/Flash/ResourceControl/TokenBucket.cpp Refactored functions to use provided timestamps; removed unnecessary iostream include.
dbms/src/Flash/ResourceControl/MockLocalAdmissionController.cpp Updated reConfig call to include current time.
dbms/src/Flash/Pipeline/Schedule/TaskQueues/tests/gtest_resource_control_queue.cpp Updated reConfig calls to include timestamps.
dbms/src/Flash/Pipeline/Schedule/TaskQueues/ResourceControlQueue.cpp Updated wait duration variable and comments to reflect new timeout logic.
dbms/src/Common/TiFlashMetrics.h Added new metrics for low token threshold, GAC requests, and degradation mode.

@guo-shaoge guo-shaoge changed the title fix tiflash resource control low token signal is missed fix tiflash resource control low token signal miss May 9, 2025
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge requested review from Copilot and windtalker May 9, 2025 08:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the low token signal issue in the TiFlash resource control by adjusting how the low token threshold is computed, refactoring time-related parameters in token bucket methods, and reducing the minimal wait duration. Key changes include:

  • Changing the low token threshold calculation in TokenBucket to use init_tokens instead of capacity.
  • Updating related APIs (consume, reConfig, peek, compact, getDynamicTokens) to require an explicit timestamp.
  • Adjusting the minimal wait duration from 100ms to 10ms and updating related configurations and metrics.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
dbms/src/Flash/ResourceControl/TokenBucket.h Updated threshold calculation and API signatures to include time.
dbms/src/Flash/ResourceControl/TokenBucket.cpp Modified function implementations to use the new time parameter.
dbms/src/Flash/ResourceControl/MockLocalAdmissionController.cpp Updated reConfig call to include a timestamp.
dbms/src/Flash/Pipeline/Schedule/TaskQueues/tests/gtest_resource_control_queue.cpp Adjusted reConfig invocation with proper timestamp.
dbms/src/Flash/Pipeline/Schedule/TaskQueues/ResourceControlQueue.cpp Changed wait duration assignment for scheduling tasks.
dbms/src/Common/TiFlashMetrics.h Added new metrics for low token threshold and GAC-related counts.
Comments suppressed due to low confidence (2)

dbms/src/Flash/ResourceControl/TokenBucket.h:47

  • [nitpick] Please confirm that computing low_token_threshold using init_tokens_ instead of capacity_ is the intended behavior, as this change may affect the signaling logic.
        , low_token_threshold(LOW_TOKEN_THRESHOLD_RATE * init_tokens_)

dbms/src/Flash/Pipeline/Schedule/TaskQueues/ResourceControlQueue.cpp:117

  • [nitpick] Verify that using DEFAULT_MAX_EST_WAIT_DURATION for wait_dura achieves the desired scheduling responsiveness compared to the previous DEFAULT_FETCH_GAC_INTERVAL_MS.
        UInt64 wait_dura = LocalAdmissionController::DEFAULT_MAX_EST_WAIT_DURATION.count();

@guo-shaoge guo-shaoge requested a review from JaySon-Huang May 20, 2025 11:46
JaySon-Huang and others added 6 commits May 20, 2025 20:34
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Refine some error message for troubleshooting
Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, solotzg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [JaySon-Huang,solotzg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 20, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 20, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-05-14 09:22:01.86890777 +0000 UTC m=+523388.016868989: ☑️ agreed by solotzg.
  • 2025-05-20 13:09:44.999224331 +0000 UTC m=+94422.100391553: ☑️ agreed by JaySon-Huang.

Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot merged commit 3e1aa83 into pingcap:master May 20, 2025
5 checks passed
@JaySon-Huang JaySon-Huang deleted the fix_token_miss branch May 20, 2025 14:09
@guo-shaoge
Copy link
Contributor Author

/run-cherry-picker

@ti-chi-bot ti-chi-bot bot added needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. labels May 21, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #10193.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #10194.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #10195.
But this PR has conflicts, please resolve them!

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request May 21, 2025
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this pull request May 21, 2025
close #10137

Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon-Huang <tshent@qq.com>
ti-chi-bot bot pushed a commit that referenced this pull request May 22, 2025
close #10137

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: guo-shaoge <shaoge1994@163.com>

Co-authored-by: guo-shaoge <shaoge1994@163.com>
@ti-chi-bot ti-chi-bot bot removed the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jun 12, 2025
ti-chi-bot bot pushed a commit that referenced this pull request Jun 13, 2025
close #10137

Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon-Huang <tshent@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiflash resource control triggers unexpected throttling for periodically high-traffic workload

5 participants