Skip to content

Yield CPU for concurrent flush and concurrent mergeDelta#5410

Merged
ti-chi-bot merged 6 commits intomasterfrom
wenxuan/compact/idle_wait
Jul 20, 2022
Merged

Yield CPU for concurrent flush and concurrent mergeDelta#5410
ti-chi-bot merged 6 commits intomasterfrom
wenxuan/compact/idle_wait

Conversation

@breezewish
Copy link
Member

@breezewish breezewish commented Jul 19, 2022

What problem does this PR solve?

Issue Number: close #5409

What is changed and how it works?

Add sleep for flushCache and mergeDeltaBySegment:

  • When flushCache is retrying, wait backoff will be 5ms ~ 100ms (considering that existing flushCache usually takes short time to finish).

  • When mergeDeltaBySegment is retrying, wait backoff will be 50ms ~ 1s (considering that split-prepare could take several seconds to finish).

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

To test with the fix, I introduced a splitEachSegment debug function locally to manually trigger a split:

void DeltaMergeStore::splitEachSegment(const Context & db_context)
{
    auto dm_context = newDMContext(db_context, db_context.getSettingsRef(), /*tracing_id*/ "split_each_segment");
    std::vector<SegmentPtr> current_segments;
    {
        std::shared_lock lock(read_write_mutex);
        for (const auto & segment_it : segments)
        {
            current_segments.push_back(segment_it.second);
        }
    }
    for (const auto & segment : current_segments)
    {
        segmentSplit(*dm_context, segment, true);
    }
}

The test case is to trigger the split for a 1GB segment, and then perform a mergeDelta at the same time.

Before the fix (using release v6.1):

when there are both split (takes 10s) and mergeDelta (takes 20s in total, blocked by split for 10s), there are 211K retries in 10s when the mergeDelta is blocked:

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | head
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | tail
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | wc -l
  211848

The CPU usage is around 200% during the split+mergeDelta:

5.7   tiflash
128.0 tiflash  <--- split begin, mergeDelta begin
196.4 tiflash
185.1 tiflash
152.7 tiflash
184.9 tiflash
202.0 tiflash
191.3 tiflash
197.7 tiflash
192.1 tiflash
202.6 tiflash  <--- split is finished
115.7 tiflash
102.6 tiflash
103.0 tiflash
101.3 tiflash
101.0 tiflash
103.9 tiflash
100.8 tiflash
102.0 tiflash
102.5 tiflash
68.2  tiflash
4.7   tiflash
0.0   tiflash

After the fix:

there are only 14 retry attempts with exp backoff:

❯ cat ~/.tiup/data/my_test_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment"
[2022/07/19 20:58:44.240 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.295 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.399 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.604 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:45.007 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:45.812 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:46.815 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:47.821 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:48.825 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:49.826 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:50.831 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:51.832 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:52.838 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:53.844 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]

The CPU usage keeps around 100% (first 11s for split, next 10s for mergeDelta):

4.3  tiflash
21.4 tiflash  <--- split begin, mergeDelta begin
100.2 tiflash
103.1 tiflash
95.2  tiflash
89.1  tiflash
99.4  tiflash
100.8 tiflash
101.7 tiflash
103.2 tiflash
102.4 tiflash
97.8  tiflash  <--- split is finished
71.7  tiflash
102.8 tiflash
99.1  tiflash
102.6 tiflash
102.5 tiflash
95.5  tiflash
100.7 tiflash
105.1 tiflash
101.5 tiflash
103.4 tiflash
49.9  tiflash
4.2   tiflash

Note: As there is maximum 1s backoff, the CPU usage dropped for a short while when split was finished and the mergeDelta was not yet started.

  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Reduce unnecessary CPU usage in some edge cases

Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Wish <breezewish@outlook.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jul 19, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JaySon-Huang
  • flowbehappy

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Details

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. needs-cherry-pick-release-6.2 and removed do-not-merge/needs-triage-completed labels Jul 19, 2022
@breezewish
Copy link
Member Author

/run-all-tests

@sre-bot
Copy link
Collaborator

sre-bot commented Jul 20, 2022

Coverage for changed files

Filename                                                                                                        Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/home/jenkins/agent/workspace/tiflash-build-common/tiflash/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp        1471               531    63.90%          67                 5    92.54%        2066               498    75.90%         856               398    53.50%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                              1471               531    63.90%          67                 5    92.54%        2066               498    75.90%         856               398    53.50%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
18625      9392             49.57%    212355  94880        55.32%

full coverage report (for internal network access only)

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 20, 2022
Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jul 20, 2022
@JaySon-Huang
Copy link
Contributor

/run-all-tests

@sre-bot
Copy link
Collaborator

sre-bot commented Jul 20, 2022

Coverage for changed files

Filename                                                                                                        Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/home/jenkins/agent/workspace/tiflash-build-common/tiflash/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp        1471               531    63.90%          67                 5    92.54%        2066               498    75.90%         856               400    53.27%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                              1471               531    63.90%          67                 5    92.54%        2066               498    75.90%         856               400    53.27%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
18625      9392             49.57%    212355  94845        55.34%

full coverage report (for internal network access only)

@breezewish
Copy link
Member Author

/merge

@ti-chi-bot
Copy link
Member

@breezewish: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

DetailsCommit hash: 68ee85f

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 20, 2022
@sre-bot
Copy link
Collaborator

sre-bot commented Jul 20, 2022

Coverage for changed files

Filename                                                                                                        Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/home/jenkins/agent/workspace/tiflash-build-common/tiflash/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp        1471               531    63.90%          67                 5    92.54%        2066               498    75.90%         856               399    53.39%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                              1471               531    63.90%          67                 5    92.54%        2066               498    75.90%         856               399    53.39%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
18625      9392             49.57%    212355  94869        55.33%

full coverage report (for internal network access only)

@ti-chi-bot ti-chi-bot merged commit 9ccfc4b into master Jul 20, 2022
ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Jul 20, 2022
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #5423.

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Jul 20, 2022
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #5424.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: cannot checkout release-6.2: error checking out release-6.2: exit status 1. output: error: pathspec 'release-6.2' did not match any file(s) known to git

@breezewish breezewish deleted the wenxuan/compact/idle_wait branch August 17, 2022 16:48
@breezewish
Copy link
Member Author

/run-cherry-picker

breezewish added a commit to ti-chi-bot/tiflash that referenced this pull request Aug 17, 2022
breezewish added a commit to ti-chi-bot/tiflash that referenced this pull request Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrent flush lead to CPU waste

5 participants