Skip to content

mds: Don't use global snaprealm's seq number for subvolume snapshots#62682

Merged
vshankar merged 6 commits intoceph:mainfrom
kotreshhr:wip-cow-old-inodes
Jul 24, 2025
Merged

mds: Don't use global snaprealm's seq number for subvolume snapshots#62682
vshankar merged 6 commits intoceph:mainfrom
kotreshhr:wip-cow-old-inodes

Conversation

@kotreshhr
Copy link
Contributor

@kotreshhr kotreshhr commented Apr 4, 2025

mds: Don't use global snaprealm seq for for subvolumes

Don't use global snaprealm seq number while doing cow
on old inodes which belongs to subvolume or under it i.e.,
for directories marked with 'ceph.dir.subvolume' vxattr. This is
safe because hardlinks and renames are not allowed beyond
these subvolume directories i.e., all the hardlink/renames
are contained within the same subvolume snaprealm and doesn't
cross the subvolume snaprealms

Fixes: https://tracker.ceph.com/issues/70794
Signed-off-by: Kotresh HR khiremat@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@github-actions github-actions bot added cephfs Ceph File System tests labels Apr 4, 2025
@kotreshhr kotreshhr requested review from a team and vshankar April 4, 2025 13:32
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we'd talked about having a config option to turn this behavior on or off? Looks like it's just making an unconditional change to all filesystems (including existing ones!) with the current code?

@kotreshhr
Copy link
Contributor Author

I thought we'd talked about having a config option to turn this behavior on or off? Looks like it's just making an unconditional change to all filesystems (including existing ones!) with the current code?

Yes, that's pending. I will add it.

@kotreshhr
Copy link
Contributor Author

@vshankar the commit e9285e0 (marking 'ceph.dir.subvolume' only if the directory is empty) breaks the subvolume operations mentioned as below.

  1. The subvolume create - sets the 'ceph.dir.subvolume' after creating .meta I think.
    - This can be fixed by marking it before the creation of '.meta' file.
  2. Idempotency of subvolume operations like create - the subvolume is not empty on second create
    • This is taken care as the 'non-empty-dir' check is after the check of 'if subvolume is alreay marked'
  3. Marks all the subvolumes with 'ceph.dir.subvoulme' on subvolume open to address upgrades.
    • This needs little more thought on how to proceed further.

@vshankar
Copy link
Contributor

vshankar commented Apr 5, 2025

Thanks for the quick turnaround @kotreshhr. I will have a look at the change.

@kotreshhr
Copy link
Contributor Author

  1. Added config option to use global_snaprealm's seq number
  2. Mark subvolume vxattr before creating uuid or config files

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine. @kotreshhr We need to do targetted test for this change. What options do we have?

@kotreshhr
Copy link
Contributor Author

Mostly looks fine. @kotreshhr We need to do targetted test for this change. What options do we have?

I will explore that today.

But I didn't find a good solution for the 3rd point mentioned in the comment #62682 (comment) @vshankar You have any thoughts ?

@vshankar
Copy link
Contributor

vshankar commented Apr 7, 2025

Marks all the subvolumes with 'ceph.dir.subvoulme' on subvolume open to address upgrades.

  • This needs little more thought on how to proceed further.

How would this be done since the constraint introduced in this change disallows marking a dir as subvolume if it's not empty? Also, this needs to be done for all existing subvolumes which involves scanning the volumes directory??

EDIT: fixed typos

@kotreshhr
Copy link
Contributor Author

kotreshhr commented Apr 7, 2025

Marks all the subvolumes with 'ceph.dir.subvoulme' on subvolume open to address upgrades.

  • This needs little more thought on how to proceed further.

How would this be done since the constraint introduced in this change disallows marking a dir as subvolume if it's not empty?

Exactly the problem, we have introduced the contradictory fix with this PR. The question is how do we handle upgrades of subvolumes (legacy, v1) which don't have this xattr set ?

Also, this needs to be done for all existing subvolumes which involves scanning the volumes directory??

I didn't quite get this question. This is required for existing subvolumes which didn't had xattr set.
The xattr set currently is done during subvolume open. If there were legacy, v1, v2 subvolume prior to the introduction of this xattr, those subvolumes would get marked with this xattr during open.

EDIT: Answered second question. Had pressed comment button midway

@kotreshhr
Copy link
Contributor Author

Marks all the subvolumes with 'ceph.dir.subvoulme' on subvolume open to address upgrades.

  • This needs little more thought on how to proceed further.

How would this be done since the constraint introduced in this change disallows marking a dir as subvolume if it's not empty?

Exactly the problem, we have introduced the contradictory fix with this PR. The question is how do we handle upgrades of subvolumes (legacy, v1) which don't have this xattr set ?

Also, this needs to be done for all existing subvolumes which involves scanning the volumes directory??

I didn't quite get this question. This is required for existing subvolumes which didn't had xattr set. The xattr set currently is done during subvolume open. If there were legacy, v1, v2 subvolume prior to the introduction of this xattr, those subvolumes would get marked with this xattr during open.

EDIT: Answered second question. Had pressed comment button midway

@vshankar and me discussed about this in a meeting. Please find the discussion here.

The enforcement of setting the 'ceph.dir.subvolume' only on non-empty directory (e9285e0) is logically correct but it's not useful in the world of subvolumes. Each pvc is backed by a subvolume and it's not possible to create hardlink/rename outside it.
To make the legacy/v1/v2 subvolumes created before the introduction of the 'ceph.dir.subvolume' xattr to get marked with the xattr possible, it's decided that we skip the fix of enforcement of setting xattr on non-empty directory for now.

@kotreshhr
Copy link
Contributor Author

So I did the preliminary targetted testing with the fix.

Since the current fix uses subvolume realm's seq number for subvolume inode and inodes under it, it's preventing blowing up of cowing old inodes only for those. But for '/volumes/' and '/volumes/<group_dir>/', since the fix is using global snaprealm's seq number, the number of cowed old inodes explodes. So we are not fully solving the issue for subvolume use case as well.

@vshankar and me discussed this in the meeting and decided to use the corresponding realms sequence number fully and make the fix optional.

I will refresh the PR with the following.

  1. Remove enforcing of 'ceph.dir.subvolume' only on empty directory.
  2. Use the corresponding realm's seq number over all not restricting only to subvolume directory and below.

@kotreshhr kotreshhr force-pushed the wip-cow-old-inodes branch from bb6573c to 1cbce89 Compare April 7, 2025 13:44
@kotreshhr
Copy link
Contributor Author

So I did the preliminary targetted testing with the fix.

Since the current fix uses subvolume realm's seq number for subvolume inode and inodes under it, it's preventing blowing up of cowing old inodes only for those. But for '/volumes/' and '/volumes/<group_dir>/', since the fix is using global snaprealm's seq number, the number of cowed old inodes explodes. So we are not fully solving the issue for subvolume use case as well.

@vshankar and me discussed this in the meeting and decided to use the corresponding realms sequence number fully and make the fix optional.

I will refresh the PR with the following.

  1. Remove enforcing of 'ceph.dir.subvolume' only on empty directory.
  2. Use the corresponding realm's seq number over all not restricting only to subvolume directory and below.

Done. Yet to test this. I will test and update.

@kotreshhr kotreshhr force-pushed the wip-cow-old-inodes branch from 1cbce89 to 8186ad0 Compare April 7, 2025 19:40
@kotreshhr
Copy link
Contributor Author

So I did the preliminary targetted testing with the fix.
Since the current fix uses subvolume realm's seq number for subvolume inode and inodes under it, it's preventing blowing up of cowing old inodes only for those. But for '/volumes/' and '/volumes/<group_dir>/', since the fix is using global snaprealm's seq number, the number of cowed old inodes explodes. So we are not fully solving the issue for subvolume use case as well.
@vshankar and me discussed this in the meeting and decided to use the corresponding realms sequence number fully and make the fix optional.
I will refresh the PR with the following.

  1. Remove enforcing of 'ceph.dir.subvolume' only on empty directory.
  2. Use the corresponding realm's seq number over all not restricting only to subvolume directory and below.

Done. Yet to test this. I will test and update.

Tested the fix locally and it works fine. We need to highlight the restriction of this fix. The fix is controlled by the config mds_use_global_snaprealm_seq_for_subvol and needs to be disabled only on cephfs volumes which are dedicated to subvolumes and not used for any other purposes

I have done this in the option description here c85a6fa#diff-2c5822fc1f3196e9d93cbfe46ccdc0796a449323d075cc562332fc3b51e974eaR1572

@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

@kotreshhr kotreshhr force-pushed the wip-cow-old-inodes branch from 8186ad0 to 99efb90 Compare April 9, 2025 10:37
@kotreshhr
Copy link
Contributor Author

jenkins test make check

@kotreshhr
Copy link
Contributor Author

jenkins test make check arm64

kotreshhr added 2 commits May 9, 2025 14:42
Removing vxattr 'ceph.dir.subvolume' on a directory without
it being set causes the mds to crash. This is because the
snaprealm would be null for the directory and the null check
is missing. Setting the vxattr, creates the snaprealm for
the directory as part of it. Hence, mds doesn't crash when
the vxattr is set and then removed. This patch fixes the same.

Reproducer:
$mkdir /mnt/dir1
$setfattr -x "ceph.dir.subvolume" /mnt/dir1

Traceback:
-------
Core was generated by `./ceph/build/bin/ceph-mds -i a -c ./ceph/build/ceph.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
 #0  0x00007f33f1aa8034 in __pthread_kill_implementation () from /lib64/libc.so.6
 #1  0x00007f33f1a4ef1e in raise () from /lib64/libc.so.6
 #2  0x0000562b148a6fd0 in reraise_fatal (signum=signum@entry=11) at /ceph/src/global/signal_handler.cc:88
 #3  0x0000562b148a83d9 in handle_oneshot_fatal_signal (signum=11) at /ceph/src/global/signal_handler.cc:367
 #4  <signal handler called>
 #5  Server::handle_client_setvxattr (this=0x562b4ee3f800, mdr=..., cur=0x562b4ef9cc00) at /ceph/src/mds/Server.cc:6406
 #6  0x0000562b145fadc2 in Server::handle_client_removexattr (this=0x562b4ee3f800, mdr=...) at /ceph/src/mds/Server.cc:7022
 #7  0x0000562b145fbff0 in Server::dispatch_client_request (this=0x562b4ee3f800, mdr=...) at /ceph/src/mds/Server.cc:2825
 #8  0x0000562b145fcfa2 in Server::handle_client_request (this=0x562b4ee3f800, req=...) at /ceph/src/mds/Server.cc:2676
 #9  0x0000562b1460063c in Server::dispatch (this=0x562b4ee3f800, m=...) at /ceph/src/mds/Server.cc:382
 #10 0x0000562b1450eb22 in MDSRank::handle_message (this=this@entry=0x562b4ef42008, m=...) at /ceph/src/mds/MDSRank.cc:1222
 #11 0x0000562b14510c93 in MDSRank::_dispatch (this=this@entry=0x562b4ef42008, m=..., new_msg=new_msg@entry=true)
     at /ceph/src/mds/MDSRank.cc:1045
 #12 0x0000562b14511620 in MDSRankDispatcher::ms_dispatch (this=this@entry=0x562b4ef42000, m=...) at /ceph/src/mds/MDSRank.cc:1019
 #13 0x0000562b144ff117 in MDSDaemon::ms_dispatch2 (this=0x562b4ee64000, m=...) at /ceph/src/common/RefCountedObj.h:56
 #14 0x00007f33f2f4974a in Messenger::ms_deliver_dispatch (this=0x562b4ee70000, m=...) at /ceph/src/msg/Messenger.h:746
 #15 0x00007f33f2f467e2 in DispatchQueue::entry (this=0x562b4ee703b8) at /ceph/src/msg/DispatchQueue.cc:202
 #16 0x00007f33f2ff61cb in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /ceph/src/msg/DispatchQueue.h:101
 #17 0x00007f33f2df4b5d in Thread::entry_wrapper (this=0x562b4ee70518) at /ceph/src/common/Thread.cc:87
 #18 0x00007f33f2df4b6f in Thread::_entry_func (arg=<optimized out>) at /ceph/src/common/Thread.cc:74
 #19 0x00007f33f1aa6088 in start_thread () from /lib64/libc.so.6
 #20 0x00007f33f1b29f8c in clone3 () from /lib64/libc.so.6
---------

Fixes: https://tracker.ceph.com/issues/70794
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Don't use global snaprealm seq number while doing cow
on old inodes for subvolume inode and inodes under it
i.e., for directories marked with 'ceph.dir.subvolume'
vxattr. This is safe because all the hardlink/renames
are contained within the same subvolume snaprealm and
doesn't cross the subvolume snaprealms

For the directories between / and subvolume snapshot
directory, use the global snaprealm seq to cow the
old inodes only if there is atleast one snapshot taken.

The above behavior is made optional with the mds config
'mds_use_global_snaprealm_seq_for_subvol'. The option
is enabled by default which means the above behaviour
is disabled by default. The option is suggested to be
disabled only on cephfs volumes used for pure subvolume
usecase.

Fixes: https://tracker.ceph.com/issues/70794
Signed-off-by: Kotresh HR <khiremat@redhat.com>
@kotreshhr kotreshhr force-pushed the wip-cow-old-inodes branch from 91913da to d83b589 Compare May 9, 2025 09:12
@kotreshhr
Copy link
Contributor Author

Rebased

@kotreshhr
Copy link
Contributor Author

jenkins test api

@kotreshhr
Copy link
Contributor Author

@vshankar @batrick Tried reproducing the original issue - https://tracker.ceph.com/issues/67102 locally by creating and deleting thousands of subvolume snapshots. The old inodes on the root and /volumes directory does increase but come down to zero after journal flush. The tracker information isn't helping much to reproduce. I will sync up with QE on this to reproduce this separately. I think the optimization done as part of this PR can go in while I work at the original issue separately.

@vshankar
Copy link
Contributor

@vshankar @batrick Tried reproducing the original issue - https://tracker.ceph.com/issues/67102 locally by creating and deleting thousands of subvolume snapshots. The old inodes on the root and /volumes directory does increase but come down to zero after journal flush. The tracker information isn't helping much to reproduce. I will sync up with QE on this to reproduce this separately. I think the optimization done as part of this PR can go in while I work at the original issue separately.

That makes sense. I will have a look again and put the change to test.

@vshankar vshankar mentioned this pull request May 19, 2025
14 tasks
@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/71431.

@kotreshhr kotreshhr force-pushed the wip-cow-old-inodes branch from d83b589 to cddbc55 Compare May 29, 2025 05:47
@kotreshhr
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/71431.

@vshankar Fixed the snapshot clean up failure discussed in the above QA Tracker - https://tracker.ceph.com/issues/71431?issue_count=61&issue_position=3&next_issue_id=71428&prev_issue_id=71441#note-3

kotreshhr added 3 commits May 29, 2025 14:03
The snapshot data on any part of the subtree should be correct irrespective
enabling/disabling 'mds_use_global_snaprealm_seq_for_subvol' even with a
directory marked as subvolume.

Fixes: https://tracker.ceph.com/issues/70794
Signed-off-by: Kotresh HR <khiremat@redhat.com>
@kotreshhr kotreshhr force-pushed the wip-cow-old-inodes branch from cddbc55 to 57e1408 Compare May 29, 2025 08:33
@kotreshhr
Copy link
Contributor Author

jenkins test make check

1 similar comment
@kotreshhr
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor

vshankar commented Jun 2, 2025

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @kotreshhr

@kotreshhr
Copy link
Contributor Author

@kotreshhr
Copy link
Contributor Author

I thought we'd talked about having a config option to turn this behavior on or off? Looks like it's just making an unconditional change to all filesystems (including existing ones!) with the current code?

@gregsfortytwo This is made configurable. Re-requesting review

@kotreshhr
Copy link
Contributor Author

@batrick ptal

@gregsfortytwo gregsfortytwo dismissed their stale review July 23, 2025 15:59

My concern about configuration was addressed!

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick @vshankar I added the tests with journal flush. The results are little surprising and it's not always a fixed number of old_inodes for /volumes or /volumes/group when root_snaps are involved. Please check the note here 8b60e67#diff-d3a7e3f3f24fff510b4d2a562b2093257b20b3908748c3432d14e460c449186bR513 where I tried explaining.

Okay, so this is purely an optimization at this point. I'm fine with it.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done @kotreshhr

@vshankar vshankar merged commit 82603e9 into ceph:main Jul 24, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants