Skip to content

Fix the bug that duplicated page file block GC#2170

Merged
ti-srebot merged 7 commits intopingcap:masterfrom
JaySon-Huang:fix_page_file_duplicated
Jun 17, 2021
Merged

Fix the bug that duplicated page file block GC#2170
ti-srebot merged 7 commits intopingcap:masterfrom
JaySon-Huang:fix_page_file_duplicated

Conversation

@JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Jun 16, 2021

What problem does this PR solve?

Issue Number: close #2169

Problem Summary:
In DataCompactor::migratePages, we avoid generating a PageFile that already exists, but we didn't check whether its "Legacy" mode exists or not.
https://github.com/pingcap/tics/blob/74c69fb1d35da3582cb9279ecb4d8597e4a78d00/dbms/src/Storages/Page/gc/DataCompactor.cpp#L150-L158
https://github.com/pingcap/tics/blob/74c69fb1d35da3582cb9279ecb4d8597e4a78d00/dbms/src/Storages/Page/PageStorage.cpp#L1137-L1145

For example,

  1. We generate a PageFile "page_1000_1" for storing GC data
  2. Then the data in "page_1000_1" have been migrated to another file, and "page_1000_1" become "legacy.page_1000_1"
  3. Maybe some old files are held by snapshot for a long time, we happen to generate a PageFile "page_1000_1" again, then we have both "page_1000_1" and "legacy.page_1000_1" at the same time
  4. After the "page_1000_1" generate in step 3 become useless, we want to set it to "legacy" and remove its data, but we find "legacy.page_1000_1" already exists, then it will throw an exception and stop us from GCing useless data
  5. Finally, the TiFlash node will full of data in "t_{table_id}/log" (almost 1TiB in our case) and make the load balance bad between multiple TiFlash nodes

What is changed and how it works?

Check whether page file with same <id, level>, status in [Formal, Legacy] exists before generating PageFile for GC data

Related changes

  • Need to cherry-pick to the release branch: 5.1, 5.0, 4.0

Check List

Tests

  • Unit test

Side effects

Release note

  • Fix the bug that TiFlash can not GC delta data under rare case

@JaySon-Huang JaySon-Huang self-assigned this Jun 16, 2021
@JaySon-Huang JaySon-Huang added needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 needs-cherry-pick-release-5.0 PR which needs to be cherry-picked to release-5.0 needs-cherry-pick-release-5.1 PR which needs to be cherry-picked to release-5.1 type/bugfix This PR fixes a bug. labels Jun 16, 2021
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

@flowbehappy
Copy link
Contributor

Why "Maybe some old files are held by snapshot for a long time" can cause "generate a PageFile "page_1000_1" again" ?

Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
@JaySon-Huang JaySon-Huang force-pushed the fix_page_file_duplicated branch from f2908a6 to 9c23003 Compare June 17, 2021 03:18
Comment on lines +150 to +167
// In case that those files are hold by snapshot and do migratePages to same `migrate_file_id` again, we need to check
// whether gc_file (and its legacy file) is already exist.
//
// For example:
// First round:
// PageFile_998_0, PageFile_999_0, PageFile_1000_0
// ^ ^
// └────────────────────────────────┘
// Only PageFile_998_0 and PageFile_1000_0 are picked as candidates, it will generate PageFile_1000_1 for storing
// GC data in this round.
//
// Second round:
// PageFile_998_0, PageFile_999_0, PageFile_1000_0
// ^ ^ ^
// └────────────────┵───────────────┘
// Some how PageFile_1000_0 don't get deleted (maybe there is a snapshot that need to read Pages inside it) and
// we start a new round of GC. PageFile_998_0(again), PageFile_999_0(new), PageFile_1000_0(again) are picked into
// candidates and 1000_0 is the largest file_id.
Copy link
Contributor Author

@JaySon-Huang JaySon-Huang Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "Maybe some old files are held by snapshot for a long time" can cause "generate a PageFile "page_1000_1" again" ?

I've added an example for it.

Adding those PageFiles with no valid pages to candidates is useless but I still want to log them in DataCompactor::logMigrationDetails to easily find unexpected situation that some PageFiles are not really get removed for several GC round.
Maybe we can do some refactor on DataCompactor::selectCandidateFiles to avoid adding PageFile with no valid pages into candidates. But let's do it in a separate PR.
@flowbehappy

@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

@flowbehappy flowbehappy self-requested a review June 17, 2021 06:39
Copy link
Contributor

@flowbehappy flowbehappy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 17, 2021
@JaySon-Huang
Copy link
Contributor Author

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 17, 2021
@ti-srebot
Copy link
Collaborator

/run-all-tests

@ti-srebot
Copy link
Collaborator

cherry pick to release-4.0 in PR #2183

@ti-srebot
Copy link
Collaborator

cherry pick to release-5.0 in PR #2185

@ti-srebot
Copy link
Collaborator

cherry pick to release-5.1 in PR #2186

@JaySon-Huang JaySon-Huang deleted the fix_page_file_duplicated branch June 17, 2021 07:44
JaySon-Huang added a commit that referenced this pull request Jun 18, 2021
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: JaySon <tshent@qq.com>
JaySon-Huang pushed a commit that referenced this pull request Jul 6, 2021
* cherry pick #2170 to release-4.0

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 needs-cherry-pick-release-5.0 PR which needs to be cherry-picked to release-5.0 needs-cherry-pick-release-5.1 PR which needs to be cherry-picked to release-5.1 status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1. type/bugfix This PR fixes a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate PageFile make TiFlash can not GC delta data

3 participants