ddl: recover to the correct partition from checkpoint#44024
ddl: recover to the correct partition from checkpoint#44024ti-chi-bot[bot] merged 2 commits intopingcap:masterfrom
Conversation
|
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. DetailsReviewer can indicate their review by submitting an approval review. |
|
/merge |
|
This pull request has been accepted and is ready to merge. DetailsCommit hash: fc2e157 |
|
/retest |
|
In response to a cherrypick label: new pull request created to branch |
Add test cases for the following previously uncovered scenarios: 1. TestIngestOwnerTransferEmptyPartition (pingcap#44265): Tests owner transfer with empty partitions ensures checkpoint contains partition ID. 2. TestIngestPartitionCheckpointRecovery (pingcap#43997/pingcap#44024): Tests that checkpoint correctly saves partition info for recovery. 3. TestIngestConcurrentJobCleanupRace (pingcap#44137/pingcap#44140): Tests parallel add index jobs don't cause panic from cleanup race. 4. TestIngestCancelCleanupOrder (pingcap#43323/pingcap#43326): Tests cancel during execution doesn't cause nil pointer panic. 5. TestIngestGCSafepointBlocking (pingcap#40074/pingcap#40081): Tests add index uses correct TS and blocks GC safepoint advancement. 6. TestCheckpointInstanceAddrValidation (pingcap#43983/pingcap#43957): Tests checkpoint instance address validation works correctly. 7. TestCheckpointPhysicalIDValidation: Tests checkpoint physical table ID validation during recovery. Co-Authored-By: Warp <agent@warp.dev>
Add test cases for the following previously uncovered scenarios: 1. TestIngestOwnerTransferEmptyPartition (pingcap#44265): Tests owner transfer with empty partitions ensures checkpoint contains partition ID. 2. TestIngestPartitionCheckpointRecovery (pingcap#43997/pingcap#44024): Tests that checkpoint correctly saves partition info for recovery. 3. TestIngestConcurrentJobCleanupRace (pingcap#44137/pingcap#44140): Tests parallel add index jobs don't cause panic from cleanup race. 4. TestIngestCancelCleanupOrder (pingcap#43323/pingcap#43326): Tests cancel during execution doesn't cause nil pointer panic. 5. TestIngestGCSafepointBlocking (pingcap#40074/pingcap#40081): Tests add index uses correct TS and blocks GC safepoint advancement. 6. TestCheckpointInstanceAddrValidation (pingcap#43983/pingcap#43957): Tests checkpoint instance address validation works correctly. 7. TestCheckpointPhysicalIDValidation: Tests checkpoint physical table ID validation during recovery.
Add test cases for the following previously uncovered scenarios: 1. TestIngestOwnerTransferEmptyPartition (pingcap#44265): Tests owner transfer with empty partitions ensures checkpoint contains partition ID. 2. TestIngestPartitionCheckpointRecovery (pingcap#43997/pingcap#44024): Tests that checkpoint correctly saves partition info for recovery. 3. TestIngestConcurrentJobCleanupRace (pingcap#44137/pingcap#44140): Tests parallel add index jobs don't cause panic from cleanup race. 4. TestIngestCancelCleanupOrder (pingcap#43323/pingcap#43326): Tests cancel during execution doesn't cause nil pointer panic. 5. TestIngestGCSafepointBlocking (pingcap#40074/pingcap#40081): Tests add index uses correct TS and blocks GC safepoint advancement. 6. TestCheckpointInstanceAddrValidation (pingcap#43983/pingcap#43957): Tests checkpoint instance address validation works correctly. 7. TestCheckpointPhysicalIDValidation: Tests checkpoint physical table ID validation during recovery.
What problem does this PR solve?
Issue Number: close #43997
Problem Summary:
The basic idea of checkpoint is to recover the progress:
Note that we can only begin with partition 2 because the local checkpoint is lost when TiDB 1 crashes.
In order to represent which partition we should begin with, reorg meta is used. The reorg meta contains a tuple: (partition ID or physical table ID, start key, end key). Every time TiDB restarts in the middle state of adding an index, it tries to reset the reorg meta to the state exactly before the last global checkpoint.
Previously, we store the reorg meta in the checkpoint manager. However, we did not distinguish the "local" reorg meta and the "global" reorg meta. When a partition is complete, the reorg meta is updated immediately, leading to a new TiDB reset to the wrong partition. Finally, the index data from some of the partitions is lost.
What is changed and how it works?
mysql.ddl_reorg_metais initialized, we also initialize the checkpoint.mysql.ddl_reorg_meta).Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.