Skip to content

ddl: Fix the physically drop storage instance may block removing regions (#9442)#9811

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:release-6.5from
ti-chi-bot:cherry-pick-9442-to-release-6.5
Jan 22, 2025
Merged

ddl: Fix the physically drop storage instance may block removing regions (#9442)#9811
ti-chi-bot[bot] merged 2 commits intopingcap:release-6.5from
ti-chi-bot:cherry-pick-9442-to-release-6.5

Conversation

@ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #9442

What problem does this PR solve?

Issue Number: close #9437

Problem Summary:

When a table is dropped in tidb, and exceeds the gc_safepoint, tiflash will generate an InterpreterDropQuery to physically remove the data from the tiflash instance. The "DropQuery" will call DeltaMergeStore::dropAllSegments to remove the Segment at DeltaTree storage one by one. Because there are many segments for the table, running DeltaMergeStore::dropAllSegments is slow.
https://github.com/pingcap/tiflash/blob/v7.1.3/dbms/src/TiDB/Schema/SchemaSyncService.cpp#L216-L227
https://github.com/pingcap/tiflash/blob/v7.1.3/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp#L342-L349

[2024/09/16 21:29:29.604 +00:00] [INFO] [SchemaSyncService.cpp:170] ["Performing GC using safe point 452597098965106688"] [source="keyspace=4294967295"] [thread_id=4314]
[2024/09/16 21:29:29.610 +00:00] [INFO] [SchemaSyncService.cpp:216] ["Physically dropping table test_db(694).test_tbl(563080)"] [source="keyspace=4294967295"] [thread_id=4314]
[2024/09/16 21:29:30.446 +00:00] [INFO] [DeltaMergeStore.cpp:413] ["Clear DeltaMerge segments data"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4314]
[2024/09/16 21:29:30.465 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=292469 epoch=2 range=[?,?) next_segment_id=292472 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_1174749 stable_rows=98304 stable_bytes=576748236 dmf_rows=98304 dmf_bytes=576748236 dmf_packs=12>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=292469 epoch=2"] [thread_id=4314]
[2024/09/16 21:29:30.553 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=292223 epoch=3 range=[?,?) next_segment_id=292469 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_1205953 stable_rows=86440 stable_bytes=507435226 dmf_rows=86440 dmf_bytes=507435226 dmf_packs=11>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=292223 epoch=3"] [thread_id=4314]
...

However, DeltaMergeStore::dropAllSegments does not check that all Regions have been removed from TiFlash before the physical drop action is performed. So at the same time, there are some Region removed actions are performed.
All the raftstore threads that try to remove Region will run into removeObsoleteDataInStorage, calling DeltaMergeStore::flushCache and try to acquire a read latch on table_id=563080. At last, all the raftsotre threads are blocked by the thread that executing DeltaMergeStore::dropAllSegments.

image

But more raft-message comes into the tiflash instance, the memory usage grows and cause OOM kills. After restarts, the tiflash instance runs into the same blocking again. And at last, all the segments (around 30,000 in total) are removed from tiflash. And tiflash begins to catch-up the raft message.

[2024/09/16 23:25:37.681 +00:00] [INFO] [DeltaMergeStore.cpp:413] ["Clear DeltaMerge segments data"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4143]
...
[2024/09/16 23:46:11.570 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=281216 epoch=2 range=[?,?) next_segment_id=281738 delta_rows=0 delta_bytes=0 delta_deletes=1 stable_file=dmf_1205957 stable_rows=59514 stable_bytes=348855913 dmf_rows=59514 dmf_bytes=348855913 dmf_packs=8>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=281216 epoch=2"] [thread_id=4143]
[2024/09/16 23:46:11.714 +00:00] [INFO] [DeltaMergeStore.cpp:419] ["Clear DeltaMerge segments data done"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4143]

image

What is changed and how it works?

ddl: Fix the physical drop storage instance may block removing regions
Make sure physical drop storage instance only happen after all related regions are removed

Pick PRs:

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix the issue that TiFlash may OOM after running `DROP TABLE` on a table with large amount of rows

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-6.5 This PR is cherry-picked to release-6.5 from a source PR. labels Jan 22, 2025
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang JaySon-Huang force-pushed the cherry-pick-9442-to-release-6.5 branch from 4de1448 to 9afabdf Compare January 22, 2025 09:23
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jan 22, 2025
@ti-chi-bot ti-chi-bot bot added the lgtm label Jan 22, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [JaySon-Huang,Lloyd-Pottiger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jan 22, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 22, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-01-22 09:51:20.274987796 +0000 UTC m=+260807.605907200: ☑️ agreed by JaySon-Huang.
  • 2025-01-22 10:36:13.171343443 +0000 UTC m=+263500.502262844: ☑️ agreed by Lloyd-Pottiger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-6.5 This PR is cherry-picked to release-6.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants