ddl: recover to the correct partition from checkpoint by tangenta · Pull Request #44024 · pingcap/tidb

tangenta · 2023-05-19T14:46:01Z

What problem does this PR solve?

Issue Number: close #43997

Problem Summary:

The basic idea of checkpoint is to recover the progress:

| add index for partition 1
|  [ local checkpoint ]
| add index for partition 2
|  [ local checkpoint ]
|  [ global checkpoint ]
| ...
| add index for partition k
|  [ local checkpoint ]
| add index for partition k+1
v  (TiDB 1) crash
  | (TiDB 2) get DDL owner
  | add index for partition 2
  | ...

Note that we can only begin with partition 2 because the local checkpoint is lost when TiDB 1 crashes.

In order to represent which partition we should begin with, reorg meta is used. The reorg meta contains a tuple: (partition ID or physical table ID, start key, end key). Every time TiDB restarts in the middle state of adding an index, it tries to reset the reorg meta to the state exactly before the last global checkpoint.

Previously, we store the reorg meta in the checkpoint manager. However, we did not distinguish the "local" reorg meta and the "global" reorg meta. When a partition is complete, the reorg meta is updated immediately, leading to a new TiDB reset to the wrong partition. Finally, the index data from some of the partitions is lost.

What is changed and how it works?

Distinguish the "local" reorg meta from the "global" one.
When the mysql.ddl_reorg_meta is initialized, we also initialize the checkpoint.
Move the creation of checkpoint manager to a proper place(which needs the info from mysql.ddl_reorg_meta).
After the checkpoint manager is created, we try recover the global checkpoint and overwrite the reorg info.

Check List

Tests

Unit test
Integration test

Manual test (add detailed scripts or steps below)

1. Create two TiDB.
2. Prepare a table with a lot of partitions.
3. Add index.
4. Kill the TiDB owner. 
5. Check if the other TiDB can reset the reorg meta to a correct partition ID.

No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2023-05-19T14:46:03Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

Benjamin2037
zimulala

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Details

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Benjamin2037

LGTM

ddl/job_table.go

ddl/reorg.go

ddl/ingest/checkpoint.go

ddl/job_table.go

zimulala

LGTM

tangenta · 2023-05-22T07:25:12Z

/merge

ti-chi-bot · 2023-05-22T07:25:16Z

This pull request has been accepted and is ready to merge.

Details

Commit hash: fc2e157

tangenta · 2023-05-22T08:06:50Z

/retest

ti-chi-bot · 2023-05-22T08:16:22Z

In response to a cherrypick label: new pull request created to branch release-7.1: #44050.

close pingcap#43997

close #43997

Add test cases for the following previously uncovered scenarios: 1. TestIngestOwnerTransferEmptyPartition (pingcap#44265): Tests owner transfer with empty partitions ensures checkpoint contains partition ID. 2. TestIngestPartitionCheckpointRecovery (pingcap#43997/pingcap#44024): Tests that checkpoint correctly saves partition info for recovery. 3. TestIngestConcurrentJobCleanupRace (pingcap#44137/pingcap#44140): Tests parallel add index jobs don't cause panic from cleanup race. 4. TestIngestCancelCleanupOrder (pingcap#43323/pingcap#43326): Tests cancel during execution doesn't cause nil pointer panic. 5. TestIngestGCSafepointBlocking (pingcap#40074/pingcap#40081): Tests add index uses correct TS and blocks GC safepoint advancement. 6. TestCheckpointInstanceAddrValidation (pingcap#43983/pingcap#43957): Tests checkpoint instance address validation works correctly. 7. TestCheckpointPhysicalIDValidation: Tests checkpoint physical table ID validation during recovery. Co-Authored-By: Warp <agent@warp.dev>

Add test cases for the following previously uncovered scenarios: 1. TestIngestOwnerTransferEmptyPartition (pingcap#44265): Tests owner transfer with empty partitions ensures checkpoint contains partition ID. 2. TestIngestPartitionCheckpointRecovery (pingcap#43997/pingcap#44024): Tests that checkpoint correctly saves partition info for recovery. 3. TestIngestConcurrentJobCleanupRace (pingcap#44137/pingcap#44140): Tests parallel add index jobs don't cause panic from cleanup race. 4. TestIngestCancelCleanupOrder (pingcap#43323/pingcap#43326): Tests cancel during execution doesn't cause nil pointer panic. 5. TestIngestGCSafepointBlocking (pingcap#40074/pingcap#40081): Tests add index uses correct TS and blocks GC safepoint advancement. 6. TestCheckpointInstanceAddrValidation (pingcap#43983/pingcap#43957): Tests checkpoint instance address validation works correctly. 7. TestCheckpointPhysicalIDValidation: Tests checkpoint physical table ID validation during recovery.

ddl: recover to the correct partition from checkpoint

311a10b

Benjamin2037 approved these changes May 19, 2023

View reviewed changes

ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label May 19, 2023

zimulala reviewed May 22, 2023

View reviewed changes

ddl/job_table.go Show resolved Hide resolved

wjhuang2016 reviewed May 22, 2023

View reviewed changes

ddl/reorg.go Show resolved Hide resolved

address comment

fc2e157

zimulala reviewed May 22, 2023

View reviewed changes

ddl/ingest/checkpoint.go Show resolved Hide resolved

ddl/job_table.go Show resolved Hide resolved

zimulala approved these changes May 22, 2023

View reviewed changes

ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels May 22, 2023

wjhuang2016 approved these changes May 22, 2023

View reviewed changes

ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label May 22, 2023

ti-chi-bot bot merged commit 5652f2c into pingcap:master May 22, 2023

ti-chi-bot mentioned this pull request May 22, 2023

ddl: recover to the correct partition from checkpoint (#44024) #44050

Merged

12 tasks

tangenta added a commit to ti-chi-bot/tidb that referenced this pull request May 22, 2023

ddl: recover to the correct partition from checkpoint (pingcap#44024)

8c88b4c

close pingcap#43997

ti-chi-bot bot pushed a commit that referenced this pull request May 22, 2023

ddl: recover to the correct partition from checkpoint (#44024) (#44050)

afebf8a

close #43997

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddl: recover to the correct partition from checkpoint#44024

ddl: recover to the correct partition from checkpoint#44024
ti-chi-bot[bot] merged 2 commits intopingcap:masterfrom
tangenta:add-index-restore-correct-partition

tangenta commented May 19, 2023 •

edited

Loading

Uh oh!

ti-chi-bot bot commented May 19, 2023 •

edited

Loading

Uh oh!

Benjamin2037 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zimulala left a comment

Uh oh!

tangenta commented May 22, 2023

Uh oh!

ti-chi-bot bot commented May 22, 2023

Uh oh!

tangenta commented May 22, 2023

Uh oh!

ti-chi-bot commented May 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tangenta commented May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Uh oh!

ti-chi-bot bot commented May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Benjamin2037 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zimulala left a comment

Choose a reason for hiding this comment

Uh oh!

tangenta commented May 22, 2023

Uh oh!

ti-chi-bot bot commented May 22, 2023

Uh oh!

tangenta commented May 22, 2023

Uh oh!

ti-chi-bot commented May 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tangenta commented May 19, 2023 •

edited

Loading

ti-chi-bot bot commented May 19, 2023 •

edited

Loading