Skip to content

PageStorage: Fix tiflash wn oom issue by introducing DeltaTreeOnlySnapshot#10410

Merged
ti-chi-bot[bot] merged 14 commits intopingcap:masterfrom
JaySon-Huang:ps_gc_raftlog
Sep 9, 2025
Merged

PageStorage: Fix tiflash wn oom issue by introducing DeltaTreeOnlySnapshot#10410
ti-chi-bot[bot] merged 14 commits intopingcap:masterfrom
JaySon-Huang:ps_gc_raftlog

Conversation

@JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Sep 5, 2025

What problem does this PR solve?

Issue Number: close #10382, close #10278

Problem Summary:

Slow query requests prevent the release of PageStorage snapshots, causing persistent issues where multi-versioned data in PageStorage memory and multi-versioned columnar storage data on disk cannot be compacted. However, under the storage-compute separation architecture, TiFlash aims to support the capability of restoring new Region replicas from S3 , which motivates storing high-frequency write-and-delete data such as RaftLog into PageStorage. As a result, with Raft messages sustained at 12K QPS, a 10-minute-long query causes the number of Pages in PageStorage to increase by 7.2 million and the number of multi-versions to rise by 17.8 million. Due to the design of PageStorage, this leads to an additional 6 GB of memory consumption, and may also be accompanied by degraded read and write performance in PageStorage.

What is changed and how it works?

* UniversalPageStorage Provide two kinds of snapshot
  * GeneralSnapshot: protect all data, used for general-purpose reading on UniversalPageStorage
  * DeltaTreeOnlySnapshot: protect only data in DeltaTree engine, used for handling reading requests
* UniversalPageStorage GC
  * The page_ids from raft layer is not protected by `DeltaTreeOnlySnapshot`
  * If the page_id is from raft layer and there is no general snapshot, clean up entries more aggressively to reduce memory usage.
* WriteNode snapshot GC
  * The write node snapshot timeout is defined by compute node `EstablishDisaggTaskRequest`
  * The default snapshot timeout increase from 5 minutes to 30 minutes

Manual Testing result

# use br to restore a dataset with chbenchmark 1500
tiup br:v8.5.2 restore db --db=chbenchmark -s s3://benchmark/ch-1500 --check-requirements=false \
 --pd 10.2.12.81:6580 --keyspace-name ks1
# start tp workload with 32 tp threads, last for 90 minutes
tiup bench ch --host 10.2.12.81 -P 8081 --warehouses 1500 run -D chbenchmark -T 32 -t 0 --time 90m
# hack the code to mock that a reading request last for 15 minutes without releasing snapshot on write node, and `dt_snapshot_delta_tree_only = false`
https://github.com/pingcap/tiflash/pull/10410/commits/13e79da504d720506ea52ba52851b4128a022e14
# checkout the memory usage of tiflash write node
# update `dt_snapshot_delta_tree_only = false` and mock another reading request last for 15 minutes without releasing snapshot on write node

According to the test result

  • 21:09~21:25, acquire a GeneralSnapshot as before this PR does.
    • The memory usage peak reach 17.4GB. Increase more than 12.7GB for the frequent insert and delete raft pages
    • The number of pages stored in UniPageStorage reach 19.6 million, and versioned entries reach 49.5 million
    • The disk usage of UniPageStorage increase from 6GB to 24GB
  • 21:34~21:50, acquire a DeltaTreeOnlySnapshot
    • The memory usage peak remain less than 5.8GB. Increase only 1.7GB from the lower watermark
    • The number of pages stored in UniPageStorage is only 3.4 million, and versioned entries is 5.7 million
    • The disk usage of UniPageStorage increase from 6GB to 16GB
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    described as above
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix a OOM issue when there is slow query under compute and storage disagg arch

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/needs-triage-completed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 5, 2025
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang JaySon-Huang force-pushed the ps_gc_raftlog branch 3 times, most recently from 5478317 to aafdf08 Compare September 8, 2025 15:27
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
…ogic

Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 8, 2025
@JaySon-Huang JaySon-Huang removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2025
@JaySon-Huang
Copy link
Contributor Author

/test all

@JaySon-Huang JaySon-Huang changed the title [WIP] PageStorage: Two kinds of snapshot PageStorage: Fix tiflash wn oom issue by DeltaTreeOnlySnapshot Sep 8, 2025
@JaySon-Huang
Copy link
Contributor Author

/test all

@JaySon-Huang
Copy link
Contributor Author

/test all

@wuhuizuo
Copy link
Contributor

wuhuizuo commented Sep 9, 2025

/retest

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Sep 9, 2025
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Sep 9, 2025

/hold
for adding metrics

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2025
Signed-off-by: JaySon-Huang <tshent@qq.com>
Copy link
Member

@CalvinNeo CalvinNeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Sep 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CalvinNeo, JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Sep 9, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Sep 9, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-09-09 04:02:01.752307367 +0000 UTC m=+333988.312188854: ☑️ agreed by JinheLin.
  • 2025-09-09 07:35:52.466191874 +0000 UTC m=+346819.026073391: ☑️ agreed by CalvinNeo.

Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang JaySon-Huang changed the title PageStorage: Fix tiflash wn oom issue by DeltaTreeOnlySnapshot PageStorage: Fix tiflash wn oom issue by introducing DeltaTreeOnlySnapshot Sep 9, 2025
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang
Copy link
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2025
@ti-chi-bot ti-chi-bot bot merged commit 34a302d into pingcap:master Sep 9, 2025
7 checks passed
@JaySon-Huang JaySon-Huang deleted the ps_gc_raftlog branch September 9, 2025 14:53
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Sep 19, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #10448.
But this PR has conflicts, please resolve them!

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Sep 19, 2025
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this pull request Sep 22, 2025
…pshot (#10410) (#10448)

close #10278, close #10382

* UniversalPageStorage Provide two kinds of snapshot
  * GeneralSnapshot: protect all data, used for general-purpose reading on UniversalPageStorage
  * DeltaTreeOnlySnapshot: protect only data in DeltaTree engine, used for handling reading requests
* UniversalPageStorage GC
  * The page_ids from raft layer is not protected by `DeltaTreeOnlySnapshot`
  * If the page_id is from raft layer and there is no general snapshot, clean up entries more aggressively to reduce memory usage.
* WriteNode snapshot GC
  * The write node snapshot timeout is defined by compute node `EstablishDisaggTaskRequest`
  * The default snapshot timeout increase from 5 minutes to 30 minutes

Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon-Huang <tshent@qq.com>
Co-authored-by: JaySon <tshent@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[next gen on cloud] tiflash wn oom during running simulated workload [next gen on clould] one of tiflash wn continuous oom

5 participants