Skip to content

quincy: mds: skip sr moves when target is an unlinked dir#56673

Merged
vshankar merged 7 commits intoceph:quincyfrom
batrick:wip-65293-quincy
Nov 28, 2024
Merged

quincy: mds: skip sr moves when target is an unlinked dir#56673
vshankar merged 7 commits intoceph:quincyfrom
batrick:wip-65293-quincy

Conversation

@batrick
Copy link
Member

@batrick batrick commented Apr 3, 2024

backport tracker: https://tracker.ceph.com/issues/65293


backport of #55768
parent tracker: https://tracker.ceph.com/issues/53192

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

batrick added 7 commits April 3, 2024 13:02
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 7c1823a)
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 2f93777)
This can print a ludicrous number of lines for large cache sizes.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 681169c)
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 54c20f5)
It can dominate logs when large splits occur.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 34cb630)
This change uses an unordered_map to memoize results of CInode::is_ancestor_of
so that subsequent invocations can skip directory inodes which are already
known to not be a descendent of the target directory.

In the worst case, this unordered_map can grow to the number of inodes in
memory when all inodes are directories and at least one client has a cap for
each inode. However, in general this will not be the case. The size of each
entry in the map will be a 64-bit pointer and bool. The total size will vary
across platforms but we can say that with a conservative estimate of 192 bits /
entry overhead (including the entry linked list pointer in the bucket), the map
will grow to ~24MB / 1M inodes.

The result of this change is not eye-popping but it does have a significant performance advantage.

For an unpatched MDS with 1M inodes with caps in the global snaprealm (with debugging commits preceding this one):

    2024-02-27T01:08:53.247+0000 7f4be40ec700  2 mds.0.cache Memory usage:  total 6037860, rss 5710800, heap 215392, baseline 199008, 1000251 / 1000323 inodes have caps, 1000251 caps, 0.999928 caps per inode
    ...
    2024-02-27T01:08:54.000+0000 7f4be18e7700 10  mds.0.cache.snaprealm(0x1 seq 3 0x55feaf85ad80) split_at: snaprealm(0x1000000043b seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55feb986b200) on [inode 0x1000000043b [...4,head] ~mds0/stray8/1000000043b/ auth v152 pv153 ap=3 snaprealm=0x55feb986b200 f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (ifile excl) (iversion lock w=1 last_client=4361) caps={4361=pAsXsFs/-@6},l=4361 | request=1 lock=3 caps=1 authpin=1 0x56000423d180]
    2024-02-27T01:08:54.649+0000 7f4be18e7700 10 mds.0.cache.ino(0x1000000043b) move_to_realm joining realm snaprealm(0x1000000043b seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55feb986b200), leaving realm snaprealm(0x1 seq 3 lc 3 cr 3 cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:06:29.440802+0000),3=snap(3 0x1 'two' 2024-02-27T01:06:43.209349+0000)} last_modified 2024-02-27T01:06:43.209349+0000 change_attr 2 0x55feaf85ad80)
    2024-02-27T01:08:54.750+0000 7f4be18e7700 10  mds.0.cache.snaprealm(0x1 seq 3 0x55feaf85ad80) split_at: split 1 inodes

so around 750ms to check all inodes_with_caps (1M) in the global snaprealm. This result was fairly consistent for multiple tries.

For a 100k split:

    2024-02-27T04:12:27.548+0000 7f2da9dbe700 10 mds.0.cache.ino(0x1000000000f) open_snaprealm snaprealm(0x1000000000f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x563553c92900) parent is snaprealm(0x1 seq 2 lc 2 cr 2 cps 1 snaps={2=snap(2 0x1 '1' 2024-02-27T04:12:13.803030+0000)} last_modified 2024-02-27T04:12:13.803030+0000 change_attr 1 0x563553abed80)
    2024-02-27T04:12:27.548+0000 7f2da9dbe700 10  mds.0.cache.snaprealm(0x1 seq 2 0x563553abed80) split_at: snaprealm(0x1000000000f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x563553c92900) on [inode 0x1000000000f [...3,head] /tmp.K9bdjohIVa/ auth v10972 ap=2 snaprealm=0x563553c92900 f(v0 m2024-02-27T04:03:37.953918+0000 1=0+1) n(v106 rc2024-02-27T04:12:27.544141+0000 rs1 99755=0+99755) old_inodes=1 (isnap xlock x=1 by 0x5636a6372900) (inest lock dirty) (ifile excl) (iversion lock w=1 last_client=20707) caps={20707=pAsLsXsFsx/AsLsXsFsx@8},l=20707 | dirtyscattered=1 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x563553cfd180]
    2024-02-27T04:12:28.886+0000 7f2da9dbe700 10  mds.0.cache.snaprealm(0x1 seq 2 0x563553abed80) split_at: split 100031 inodes

or about 1,338ms. This caused a split of 100k inodes. This takes more time
because directories are actually moved to the snaprealm with a lot of list
twiddling for caps.

With this patch, we bring that down, for 1 split:

    2024-02-27T02:09:48.549+0000 7ff854ad4700  2 mds.0.cache Memory usage:  total 5859852, rss 4290012, heap 231776, baseline 190816, 1000312 / 1000327 inodes have caps, 1000312 caps, 0.999985 caps per inode
    ...
    2024-02-27T02:09:48.550+0000 7ff8522cf700 10 mds.0.cache.ino(0x100000f456f) open_snaprealm snaprealm(0x100000f456f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e2b4fab40) parent is snaprealm(0x1 seq 9 lc 9 cr 9 cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:34:36.001053+0000),3=snap(3 0x1 'two' 2024-02-27T01:34:48.623349+0000),6=snap(6 0x1 'six' 2024-02-27T02:03:51.619896+0000),7=snap(7 0x1 'seven' 2024-02-27T02:04:28.375336+0000),8=snap(8 0x1 '1' 2024-02-27T02:06:14.170884+0000),9=snap(9 0x1 '2' 2024-02-27T02:09:47.158624+0000)} last_modified 2024-02-27T02:09:47.158624+0000 change_attr 6 0x559dfd4ad8c0)
    2024-02-27T02:09:48.550+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) split_at: snaprealm(0x100000f456f seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e2b4fab40) on [inode 0x100000f456f [...a,head] ~mds0/stray2/100000f456f/ auth v1164 pv1165 ap=3 snaprealm=0x559e2b4fab40 DIRTYPARENT f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/AsLsXsFsx@6},l=4365 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x559e8a8bd600]
    2024-02-27T02:09:48.550+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0)  open_children are 0x559dfd4add40,0x559e1cca1d40
    2024-02-27T02:09:48.919+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 9 0x559dfd4ad8c0) split_at: split 1 inodes

or about 370ms. This was also fairly consistent across multiple tries.

For a 100k split:

    2024-02-27T01:52:24.500+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) split_at: snaprealm(0x10000000013 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x559e1cca1d40) on [inode 0x10000000013 [...5,head] /tmp.RIUAaU5wuE/ auth v10499 ap=2 snaprealm=0x559e1cca1d40 f(v0 m2024-02-27T01:16:04.611198+0000 1=0+1) n(v122 rc2024-02-27T01:52:24.495465+0000 rs1 100031=0+100031) old_inodes=1 (isnap xlock x=1 by 0x559ef038a880) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/-@11},l=4365 | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirty=1 waiter=0 authpin=1 0x559e0238c580]
    2024-02-27T01:52:24.500+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0)  open_children are 0x559dfd4add40
    2024-02-27T01:52:25.338+0000 7ff8522cf700 10  mds.0.cache.snaprealm(0x1 seq 3 0x559dfd4ad8c0) split_at: split 100031 inodes

or about 840ms. This can be easily done by making a directory in one of the
trees created (see reproducer below).

Reproducing can be done with:

    for ((i =0; i < 10; i++)); do (pushd $(mktemp -d -p . ); for ((j = 0; j < 30; ++j)); do mkdir "$j"; pushd "$j"; done; for ((j = 0; j < 10; ++j)); do for ((k = 0; k < 10000; ++k)); do mkdir $j.$k; done & done) & done

to make 1M directories. We put the majority of directories in a 30-deep nesting
to exercise CInode::is_ancestor_of with some worst-case type scenario.

Make sure all debugging configs are disabled for the MDS/clients. Make sure the
client has a cache size to accomodate 1M caps. Make at least one snapshot:

    mkdir .snap/one

Then reproduction can be done with:

    $ mkdir tmp.qQNsTpxpvh/dir; mkdir .snap/$((++i)); rmdir tmp.qQNsTpxpvh/dir

It is not necessary to delete any snapshots to reproduce this behavior. It's
only necessary to have a lot of inodes_with_caps in a snaprealm and effect a
split.

Fixes: https://tracker.ceph.com/issues/53192
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit a0ccb79)
A directory in the stray directory cannot have any HEAD inodes with caps so
there is no need to move anything to the snaprealm opened for the unlinked
directory.

Following the parent commit's reproducer, the behavior now looks expectedly like:

    2024-02-27T02:26:59.049+0000 7f5b095f3700 10 mds.0.cache.ino(0x100000f4575) open_snaprealm snaprealm(0x100000f4575 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x5632a57f9680) parent is snaprealm(0x1 seq e lc e cr e cps 1 snaps={2=snap(2 0x1 'one' 2024-02-27T01:34:36.001053+0000),3=snap(3 0x1 'two' 2024-02-27T01:34:48.623349+0000),6=snap(6 0x1 'six' 2024-02-27T02:03:51.619896+0000),7=snap(7 0x1 'seven' 2024-02-27T02:04:28.375336+0000),8=snap(8 0x1 '1' 2024-02-27T02:06:14.170884+0000),9=snap(9 0x1 '2' 2024-02-27T02:09:47.158624+0000),a=snap(a 0x1 '3' 2024-02-27T02:18:24.666934+0000),b=snap(b 0x1 '4' 2024-02-27T02:18:38.268874+0000),c=snap(c 0x1 '5' 2024-02-27T02:23:13.183995+0000),d=snap(d 0x1 '6' 2024-02-27T02:25:25.593014+0000),e=snap(e 0x1 '7' 2024-02-27T02:26:55.184945+0000)} last_modified 2024-02-27T02:26:55.184945+0000 change_attr 11 0x5632861c5680)
    2024-02-27T02:26:59.049+0000 7f5b095f3700 10  mds.0.cache.snaprealm(0x1 seq 14 0x5632861c5680) split_at: snaprealm(0x100000f4575 seq 0 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x5632a57f9680) on [inode 0x100000f4575 [...f,head] ~mds0/stray0/100000f4575/ auth v1199 pv1200 ap=3 snaprealm=0x5632a57f9680 DIRTYPARENT f() n(v0 1=0+1) old_inodes=1 (ilink xlockdone x=1) (isnap xlockdone x=1) (inest lock) (ifile excl) (iversion lock w=1 last_client=4365) caps={4365=pAsLsXsFsx/AsLsXsFsx@6},l=4365 | request=1 lock=3 dirfrag=1 caps=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563385e94000]
    2024-02-27T02:26:59.049+0000 7f5b095f3700 10  mds.0.cache.snaprealm(0x1 seq 14 0x5632861c5680)  moving unlinked directory inode

Discussions with Dan van der Ster led to the creation of this patch.

Fixes: https://tracker.ceph.com/issues/53192
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Dan van der Ster <dan.vanderster@clyso.com>
(cherry picked from commit c190a3f)
@batrick batrick added this to the quincy milestone Apr 3, 2024
@batrick batrick added the cephfs Ceph File System label Apr 3, 2024
@joscollin joscollin requested a review from a team June 20, 2024 08:55
@joscollin
Copy link
Member

This PR is under test in https://tracker.ceph.com/issues/66597.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 26, 2024
@batrick batrick added stale and removed stale labels Nov 26, 2024
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit c5dd5c9 into ceph:quincy Nov 28, 2024
@batrick batrick deleted the wip-65293-quincy branch December 3, 2024 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants