Skip to content

Ignore non-continuous sequence number among PageFile meta#2312

Merged
ti-srebot merged 4 commits intopingcap:masterfrom
JaySon-Huang:fix_ps_gc
Jul 6, 2021
Merged

Ignore non-continuous sequence number among PageFile meta#2312
ti-srebot merged 4 commits intopingcap:masterfrom
JaySon-Huang:fix_ps_gc

Conversation

@JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Jul 2, 2021

Signed-off-by: JaySon-Huang jayson.hjs@gmail.com

What problem does this PR solve?

close #2317
Link to these PR: #2187, #1552

  1. Consider that if we migrate some pages into a PageFile, and those pages are not updated/deleted for a long time. Then they will block us from compacting those legacy files.
  2. Crash in the middle of writing may lead to holes among PageFile meta, and block GC from running normally.

Here is an example log that we meet. After compacting data part, there are still 22854 PageFiles left on disk. And 22838 of them are legacy PageFiles. And those legacy files are not compacted in the next round.

[2021/06/30 04:31:18.934 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log restore 0 puts and 11868 refs and 0 deletes and 11867 upserts from checkpoint PageFile_29670_0 sequence: 29391228"] [thread_id=5147]
[2021/06/30 04:31:18.936 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log collectPageFilesToCompact stop on PageFile_29674_0, type: Legacy, sequence: 29391230 last sequence: 29391228"] [thread_id=5147]
[2021/06/30 04:31:26.260 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log LegacyCompactor::tryCompact exit without compaction, candidates size: 0, compact_legacy_min_num: 3"] [thread_id=5147]
[2021/06/30 04:31:26.291 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log PageFile_49627_0 is full, create new PageFile_49631_0 for write [path=/data1/tidb-data/tiflash-9000/data/t_154/log]"] [thread_id=6565]
[2021/06/30 04:31:26.663 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log GC decide to migrate 7 files, containing 14479 pages to PageFile_49619_1, path /data1/tidb-data/tiflash-9000/data/t_154/log"] [thread_id=5147]
[2021/06/30 04:31:32.237 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log Migrate pages to PageFile_49619_1, migrate: [((49602,1),13353),((49614,1),840),((49615,0),51),((49616,0),61),((49617,0),51),((49618,0),53),((49619,0),70),], remove: [], Config{ PageStorage::Config {gc_min_files:3, gc_min_bytes:67108864, gc_max_valid_rate:0.950, gc_min_legacy_num:3, gc_max_expect_legacy: 100, gc_max_valid_rate_bound: 0.950, prob_do_gc_when_write_is_low:100, open_file_max_idle_time:15} }"] [thread_id=5147]
[2021/06/30 04:31:32.237 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log GC have migrated 14479 Pages to PageFile_49619_1"] [thread_id=5147]
[2021/06/30 04:31:32.283 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log gcApply remove 20 invalid snapshots, 2 snapshots left, longest lifetime 0.061 seconds, created from thread_id 6548"] [thread_id=5147]
[2021/06/30 04:31:32.353 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log GC exit within 113.44 sec. PageFiles from [29670,0,Checkpoint] to [49619,0,Formal], num files: 22845, num legacy:22838, compact legacy archive files: 0, remove data files: 0, gc apply: 0 puts and 0 refs and 0 deletes and 14479 upserts"] [thread_id=5147]

[2021/06/30 04:33:26.014 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log restore 0 puts and 11868 refs and 0 deletes and 11867 upserts from checkpoint PageFile_29670_0 sequence: 29391228"] [thread_id=6370]
[2021/06/30 04:33:26.017 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log collectPageFilesToCompact stop on PageFile_29674_0, type: Legacy, sequence: 29391230 last sequence: 29391228"] [thread_id=6370]
[2021/06/30 04:33:33.326 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log LegacyCompactor::tryCompact exit without compaction, candidates size: 0, compact_legacy_min_num: 3"] [thread_id=6370]
[2021/06/30 04:33:33.963 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log GC decide to migrate 16 files, containing 14182 pages to PageFile_49627_1, path /data1/tidb-data/tiflash-9000/data/t_154/log"] [thread_id=6370]
[2021/06/30 04:33:39.122 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log Migrate pages to PageFile_49627_1, migrate: [((49619,1),13690),((49620,0),50),((49621,0),47),((49622,0),55),((49623,0),61),((49624,0),68),((49625,0),68),((49626,0),68),((49627,0),75),], remove: [(49602,1),(49614,1),(49615,0),(49616,0),(49617,0),(49618,0),(49619,0),], Config{ PageStorage::Config {gc_min_files:3, gc_min_bytes:67108864, gc_max_valid_rate:0.950, gc_min_legacy_num:3, gc_max_expect_legacy: 100, gc_max_valid_rate_bound: 0.950, prob_do_gc_when_write_is_low:100, open_file_max_idle_time:15} }"] [thread_id=6370]
[2021/06/30 04:33:39.122 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log GC have migrated 14182 Pages to PageFile_49627_1"] [thread_id=6370]
[2021/06/30 04:33:39.156 +08:00] [DEBUG] [<unknown>] ["PageStorage: db_45.t_154.log gcApply remove 218 invalid snapshots, 2 snapshots left, longest lifetime 6.971 seconds, created from thread_id 6677"] [thread_id=6370]
[2021/06/30 04:33:39.580 +08:00] [INFO] [<unknown>] ["PageStorage: db_45.t_154.log GC exit within 116.03 sec. PageFiles from [29670,0,Checkpoint] to [49627,0,Formal], num files: 22854, num legacy:22838, compact legacy archive files: 0, remove data files: 7, gc apply: 0 puts and 0 refs and 0 deletes and 14182 upserts"] [thread_id=6370]

What is changed and how it works?

  1. Change the default value of dt_page_num_max_gc_valid_rate to be 1.0.
  2. Log warning instead of break the GC of Legacy files.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch:

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    • Run page_stress_testing for a while, and then truncate a meta of PageFile to mock the crash. It can run GC on those legacy PageFiles.

Side effects

Release note

  • Fix the potential issue that TiFlash cannot GC the delta data after crashes

@JaySon-Huang JaySon-Huang added type/bugfix This PR fixes a bug. needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 needs-cherry-pick-release-5.0 PR which needs to be cherry-picked to release-5.0 needs-cherry-pick-release-5.1 PR which needs to be cherry-picked to release-5.1 labels Jul 2, 2021
@JaySon-Huang JaySon-Huang changed the title Fix some param of GC Fix some param of PageStorage GC Jul 2, 2021
@JaySon-Huang JaySon-Huang changed the title Fix some param of PageStorage GC [DNM] Fix some param of PageStorage GC Jul 3, 2021
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
@JaySon-Huang JaySon-Huang changed the title [DNM] Fix some param of PageStorage GC Ignore sequence hole among PageFile meta Jul 6, 2021
@flowbehappy flowbehappy self-requested a review July 6, 2021 08:06
Copy link
Contributor

@flowbehappy flowbehappy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 6, 2021
@JaySon-Huang
Copy link
Contributor Author

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 6, 2021
@ti-srebot
Copy link
Collaborator

/run-all-tests

@ti-srebot ti-srebot merged commit f9245fe into pingcap:master Jul 6, 2021
@ti-srebot
Copy link
Collaborator

cherry pick to release-4.0 in PR #2335

@ti-srebot
Copy link
Collaborator

cherry pick to release-5.0 in PR #2336

@ti-srebot
Copy link
Collaborator

cherry pick to release-5.1 in PR #2337

JaySon-Huang added a commit to JaySon-Huang/tiflash that referenced this pull request Aug 4, 2021
@JaySon-Huang JaySon-Huang changed the title Ignore sequence hole among PageFile meta Ignore non-continuous sequence number among PageFile meta Aug 4, 2021
JaySon-Huang added a commit to JaySon-Huang/tiflash that referenced this pull request Aug 4, 2021
flowbehappy pushed a commit that referenced this pull request Aug 4, 2021
* Ignore sequence hole among PageFile meta (#2312)

* Fix bug for GC may skip unexpected WriteBatches (#2356)

* Add length check while running PageStorage GC (#2394)

* PageStorage skip non continuous sequence safely (#2435)

* Fix PageStorage GC with high valid rate PageFile (#2436)

* More debug info for DeltaTree (query_id, snapshot lifetime) (#2431)

* Fix deadlock on `removeExpiredSnapshots` (#2461)

* Add grafana panels for write throughput per instance (#2524)
JaySon-Huang pushed a commit that referenced this pull request Aug 5, 2021
* Ignore sequence hole among PageFile meta (#2312)
* Fix bug for GC may skip unexpected WriteBatches (#2356)
* Add length check while running PageStorage GC (#2394)
* PageStorage skip non continuous sequence safely (#2435)
* Fix PageStorage GC with high valid rate PageFile (#2436)

Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 needs-cherry-pick-release-5.0 PR which needs to be cherry-picked to release-5.0 needs-cherry-pick-release-5.1 PR which needs to be cherry-picked to release-5.1 status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1. type/bugfix This PR fixes a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PageStorage can not GC legacy files due to non-continuous sequence number

3 participants