Project

General

Profile

Actions

Bug #71167

open

mds: scrub pins more inodes than the mds_cache_memory_limit

Added by Dan van der Ster 11 months ago. Updated 6 days ago.

Status:
Pending Backport
Priority:
Normal
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Community (user)
Backport:
tentacle,squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
scrub
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-5885-g1800aea772
Released In:
Upkeep Timestamp:
2026-03-09T07:54:52+00:00

Description

Scrub apparently has no limit on how many inodes it'll put into the mds_co cache.
E.g.: on a scrubbing FS with the default (4GB) mds_cache_memory_limit, it's currently using >20GB of memory:

    "mds_mem": {
        "ino": 4247738,
        "ino+": 58781699,
        "ino-": 54533961,
        "dir": 1024607,
        "dir+": 7181076,
        "dir-": 6156469,
        "dn": 4247738,
        "dn+": 58781697,
        "dn-": 54533959,
        "cap": 0,
        "cap+": 0,
        "cap-": 0,
        "rss": 21863676,
        "heap": 223516
    },
    "mds": {
...
        "inodes": 4247487,
        "inodes_top": 200,
        "inodes_bottom": 173,
        "inodes_pin_tail": 4247114,
        "inodes_pinned": 4247200,
        "inodes_expired": 54535013,
        "inodes_with_caps": 0,
...
        "root_rfiles": 63455005,
        "root_rbytes": 3033087187972,
        "root_rsnaps": 0,
        "scrub_backtrace_fetch": 55276912,
        "scrub_set_tag": 0,
        "scrub_backtrace_repaired": 1463,
        "scrub_inotable_repaired": 0,
        "scrub_dir_inodes": 7180661,
        "scrub_dir_base_inodes": 1,
        "scrub_dirfrag_rstats": 7180660,
        "scrub_file_inodes": 48096246,
...
    "mempool": {
        "bloom_filter_bytes": 30427825,
        "bloom_filter_items": 30427825,
...
        "buffer_anon_bytes": 180319577,
        "buffer_anon_items": 4026703,
        "buffer_meta_bytes": 88,
        "buffer_meta_items": 1,
...
        "osdmap_bytes": 30528,
        "osdmap_items": 579,
        "osdmap_mapping_bytes": 0,
        "osdmap_mapping_items": 0,
...
        "mds_co_bytes": 14074648762,
        "mds_co_items": 683410497,
...
    },

lru trim log, an hour or so later than above:

May 01 16:00:16 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimming 0 items from LRU size=3744326 mid=610 pintail=3741618 pinned=3743454
May 01 16:00:16 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimmed 919 items
May 01 16:00:17 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimming 0 items from LRU size=3745100 mid=551 pintail=3742615 pinned=3744312
May 01 16:00:17 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimmed 805 items
May 01 16:00:18 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimming 0 items from LRU size=3746133 mid=584 pintail=3743456 pinned=3745297
May 01 16:00:18 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimmed 855 items
May 01 16:00:19 ceph-prod-mon1 ceph-mds[2715380]: mds.0.cache trim_lru trimming 0 items from LRU size=3746501 mid=557 pintail=3744478 pinned=3745705


Related issues 2 (2 open0 closed)

Copied to CephFS - Backport #75398: tentacle: mds: scrub pins more inodes than the mds_cache_memory_limitFix Under ReviewMd Mahamudur Rahaman SajibActions
Copied to CephFS - Backport #75512: squid: mds: scrub pins more inodes than the mds_cache_memory_limitFix Under ReviewMd Mahamudur Rahaman SajibActions
Actions #1

Updated by Dan van der Ster 11 months ago

  • Affected Versions v17.2.6 added
Actions #2

Updated by Venky Shankar 11 months ago

  • Subject changed from Scrub pins more inodes than the mds_cache_memory_limit to mds: scrub pins more inodes than the mds_cache_memory_limit
  • Assignee set to Venky Shankar
  • Target version set to v21.0.0
  • Source set to Community (user)
Actions #3

Updated by Greg Farnum 11 months ago

What’s the folder structure look like on this file system, Dan? Eg depth and number of files per folder?

Unless things have changed, we do a depth-first search and we pin every dentry in the ancestor folders along the way. I don’t remember why exactly but it was necessary for the scrubbing guarantees to hold. :/

Looks like that somehow adds up to 4million inodes, which is possible but definitely surprises me — usually the very large folders are leaf nodes.

Or there’s something completely different happening here. (I also remember flagging the pins as a potential problem so someone may have made a change to work around that which I missed — there were a lot of scrub changes for a couple years.)

Actions #4

Updated by Venky Shankar 10 months ago

Greg Farnum wrote in #note-3:

What’s the folder structure look like on this file system, Dan? Eg depth and number of files per folder?

Unless things have changed, we do a depth-first search and we pin every dentry in the ancestor folders along the way. I don’t remember why exactly but it was necessary for the scrubbing guarantees to hold. :/

That definitely changed to breadth-first search to support multi-mds scrub. See: https://github.com/ceph/ceph/commit/b43af152bab2c9f67fe311ba9450e06fd41e82e4

Looks like that somehow adds up to 4million inodes, which is possible but definitely surprises me — usually the very large folders are leaf nodes.

I don't think its possible to add upto 4M pinned inodes with BFS traversal, unless, the an entire level is kept pinned till scrub moved to the next level. I will have to lookup the scrub code for what's done.

Actions #5

Updated by Dan van der Ster 10 months ago

Greg Farnum wrote in #note-3:

What’s the folder structure look like on this file system, Dan? Eg depth and number of files per folder?

sorry, i don't have any insight like that. it was an FS with a bunch of k8s pvc's managed as cephfs subvolumes. I don't know how deep/broad it was.

Actions #6

Updated by Md Mahamudur Rahaman Sajib 5 months ago · Edited

It is possible to pinning the whole level,

generate a file-system provided below and make mds_max_scrub_ops_in_progress = 1

/ (root)
├── dir00001/
│   ├── subdir00001/
│   ├── subdir00002/
│   ├── subdir00003/
│   ├── ...
│   └── subdir10000/
├── dir00002/
│   ├── subdir00001/
│   ├── subdir00002/
│   ├── ...
│   └── subdir10000/
├── ...
└── dir10000/
    ├── subdir00001/
    ├── subdir00002/
    ├── ...
    └── subdir10000/

I think we can still reduce the memory usage tweaking the BFS. I think it won't be an issue even for multi-mds scrub.

While scrubbing dirfrag, instead of pushing children back of the scrub_stack we can push it front of the scrub_stack and start iterating front of the scrub_stack if any child is pushed. In that case it will be kind of DFS-ish but in DFS memory would have been defined by the tree's height But here if we compare with dfs stack,

Let's say in an intermediate simulation dfs, stack containing (dir1)-->(dir2)-->(dir3)-->(dir4)-->(dir5) path of a tree, where dir_(i + 1) is subdirectory of dir_i

In this approach it will contain (child of dir1), (child of dir2), (child of dir3), (child of dir4), (child of dir5), which will pin much much lesser inode than pinning the whole level also retaining the BFS-ish approach for multi-mds scrub. Also I am not sure whether actually we can do DFS here, for (auto it = dir->begin(); it != dir->end(); /* nop */) having this iterator in the scrub stack.

But for now, I think we can do this minimal change.

Actions #7

Updated by Md Mahamudur Rahaman Sajib 5 months ago

  • Pull request ID set to 65858
Actions #8

Updated by Md Mahamudur Rahaman Sajib 5 months ago

  • Assignee changed from Venky Shankar to Md Mahamudur Rahaman Sajib
Actions #9

Updated by Venky Shankar 5 months ago

  • Status changed from New to Fix Under Review
  • Component(FS) MDS added
Actions #10

Updated by Venky Shankar 3 months ago

  • Backport set to tentacle
Actions #11

Updated by Venky Shankar 13 days ago

  • Status changed from Fix Under Review to Pending Backport
Actions #12

Updated by Upkeep Bot 13 days ago

  • Copied to Backport #75398: tentacle: mds: scrub pins more inodes than the mds_cache_memory_limit added
Actions #13

Updated by Upkeep Bot 13 days ago

  • Tags (freeform) set to backport_processed
Actions #14

Updated by Upkeep Bot 13 days ago

  • Merge Commit set to 1800aea7724c1ad21a7a140993c4a4cddf4c86cf
  • Fixed In set to v20.3.0-5885-g1800aea772
  • Upkeep Timestamp set to 2026-03-09T07:54:52+00:00
Actions #15

Updated by Md Mahamudur Rahaman Sajib 6 days ago

@Venky Shankar Aren't we creating backport for squid?

Actions #16

Updated by Venky Shankar 6 days ago

  • Backport changed from tentacle to tentacle,squid
  • Tags (freeform) deleted (backport_processed)

Md Mahamudur Rahaman Sajib wrote in #note-15:

@Venky Shankar Aren't we creating backport for squid?

Yes, we should. Possible missed include it when updating the field.

Actions #17

Updated by Upkeep Bot 6 days ago

  • Copied to Backport #75512: squid: mds: scrub pins more inodes than the mds_cache_memory_limit added
Actions #18

Updated by Upkeep Bot 6 days ago

  • Tags (freeform) set to backport_processed
Actions

Also available in: Atom PDF