mds: Don't use global snaprealm's seq number for subvolume snapshots#62682
mds: Don't use global snaprealm's seq number for subvolume snapshots#62682
Conversation
gregsfortytwo
left a comment
There was a problem hiding this comment.
I thought we'd talked about having a config option to turn this behavior on or off? Looks like it's just making an unconditional change to all filesystems (including existing ones!) with the current code?
Yes, that's pending. I will add it. |
|
@vshankar the commit e9285e0 (marking 'ceph.dir.subvolume' only if the directory is empty) breaks the subvolume operations mentioned as below.
|
|
Thanks for the quick turnaround @kotreshhr. I will have a look at the change. |
418691f to
bb6573c
Compare
|
vshankar
left a comment
There was a problem hiding this comment.
Mostly looks fine. @kotreshhr We need to do targetted test for this change. What options do we have?
I will explore that today. But I didn't find a good solution for the 3rd point mentioned in the comment #62682 (comment) @vshankar You have any thoughts ? |
How would this be done since the constraint introduced in this change disallows marking a dir as subvolume if it's not empty? Also, this needs to be done for all existing subvolumes which involves scanning the volumes directory?? EDIT: fixed typos |
Exactly the problem, we have introduced the contradictory fix with this PR. The question is how do we handle upgrades of subvolumes (legacy, v1) which don't have this xattr set ?
I didn't quite get this question. This is required for existing subvolumes which didn't had xattr set. EDIT: Answered second question. Had pressed comment button midway |
@vshankar and me discussed about this in a meeting. Please find the discussion here. The enforcement of setting the 'ceph.dir.subvolume' only on non-empty directory (e9285e0) is logically correct but it's not useful in the world of subvolumes. Each pvc is backed by a subvolume and it's not possible to create hardlink/rename outside it. |
|
So I did the preliminary targetted testing with the fix. Since the current fix uses subvolume realm's seq number for subvolume inode and inodes under it, it's preventing blowing up of cowing old inodes only for those. But for '/volumes/' and '/volumes/<group_dir>/', since the fix is using global snaprealm's seq number, the number of cowed old inodes explodes. So we are not fully solving the issue for subvolume use case as well. @vshankar and me discussed this in the meeting and decided to use the corresponding realms sequence number fully and make the fix optional. I will refresh the PR with the following.
|
bb6573c to
1cbce89
Compare
Done. Yet to test this. I will test and update. |
1cbce89 to
8186ad0
Compare
Tested the fix locally and it works fine. We need to highlight the restriction of this fix. The fix is controlled by the config mds_use_global_snaprealm_seq_for_subvol and needs to be disabled only on cephfs volumes which are dedicated to subvolumes and not used for any other purposes I have done this in the option description here c85a6fa#diff-2c5822fc1f3196e9d93cbfe46ccdc0796a449323d075cc562332fc3b51e974eaR1572 |
|
jenkins test make check arm64 |
8186ad0 to
99efb90
Compare
|
jenkins test make check |
|
jenkins test make check arm64 |
qa/suites/fs/workload/tasks/7-mds-use-global-snaprealm-seq-config
Outdated
Show resolved
Hide resolved
Removing vxattr 'ceph.dir.subvolume' on a directory without it being set causes the mds to crash. This is because the snaprealm would be null for the directory and the null check is missing. Setting the vxattr, creates the snaprealm for the directory as part of it. Hence, mds doesn't crash when the vxattr is set and then removed. This patch fixes the same. Reproducer: $mkdir /mnt/dir1 $setfattr -x "ceph.dir.subvolume" /mnt/dir1 Traceback: ------- Core was generated by `./ceph/build/bin/ceph-mds -i a -c ./ceph/build/ceph.conf'. Program terminated with signal SIGSEGV, Segmentation fault. (gdb) bt #0 0x00007f33f1aa8034 in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007f33f1a4ef1e in raise () from /lib64/libc.so.6 #2 0x0000562b148a6fd0 in reraise_fatal (signum=signum@entry=11) at /ceph/src/global/signal_handler.cc:88 #3 0x0000562b148a83d9 in handle_oneshot_fatal_signal (signum=11) at /ceph/src/global/signal_handler.cc:367 #4 <signal handler called> #5 Server::handle_client_setvxattr (this=0x562b4ee3f800, mdr=..., cur=0x562b4ef9cc00) at /ceph/src/mds/Server.cc:6406 #6 0x0000562b145fadc2 in Server::handle_client_removexattr (this=0x562b4ee3f800, mdr=...) at /ceph/src/mds/Server.cc:7022 #7 0x0000562b145fbff0 in Server::dispatch_client_request (this=0x562b4ee3f800, mdr=...) at /ceph/src/mds/Server.cc:2825 #8 0x0000562b145fcfa2 in Server::handle_client_request (this=0x562b4ee3f800, req=...) at /ceph/src/mds/Server.cc:2676 #9 0x0000562b1460063c in Server::dispatch (this=0x562b4ee3f800, m=...) at /ceph/src/mds/Server.cc:382 #10 0x0000562b1450eb22 in MDSRank::handle_message (this=this@entry=0x562b4ef42008, m=...) at /ceph/src/mds/MDSRank.cc:1222 #11 0x0000562b14510c93 in MDSRank::_dispatch (this=this@entry=0x562b4ef42008, m=..., new_msg=new_msg@entry=true) at /ceph/src/mds/MDSRank.cc:1045 #12 0x0000562b14511620 in MDSRankDispatcher::ms_dispatch (this=this@entry=0x562b4ef42000, m=...) at /ceph/src/mds/MDSRank.cc:1019 #13 0x0000562b144ff117 in MDSDaemon::ms_dispatch2 (this=0x562b4ee64000, m=...) at /ceph/src/common/RefCountedObj.h:56 #14 0x00007f33f2f4974a in Messenger::ms_deliver_dispatch (this=0x562b4ee70000, m=...) at /ceph/src/msg/Messenger.h:746 #15 0x00007f33f2f467e2 in DispatchQueue::entry (this=0x562b4ee703b8) at /ceph/src/msg/DispatchQueue.cc:202 #16 0x00007f33f2ff61cb in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /ceph/src/msg/DispatchQueue.h:101 #17 0x00007f33f2df4b5d in Thread::entry_wrapper (this=0x562b4ee70518) at /ceph/src/common/Thread.cc:87 #18 0x00007f33f2df4b6f in Thread::_entry_func (arg=<optimized out>) at /ceph/src/common/Thread.cc:74 #19 0x00007f33f1aa6088 in start_thread () from /lib64/libc.so.6 #20 0x00007f33f1b29f8c in clone3 () from /lib64/libc.so.6 --------- Fixes: https://tracker.ceph.com/issues/70794 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Don't use global snaprealm seq number while doing cow on old inodes for subvolume inode and inodes under it i.e., for directories marked with 'ceph.dir.subvolume' vxattr. This is safe because all the hardlink/renames are contained within the same subvolume snaprealm and doesn't cross the subvolume snaprealms For the directories between / and subvolume snapshot directory, use the global snaprealm seq to cow the old inodes only if there is atleast one snapshot taken. The above behavior is made optional with the mds config 'mds_use_global_snaprealm_seq_for_subvol'. The option is enabled by default which means the above behaviour is disabled by default. The option is suggested to be disabled only on cephfs volumes used for pure subvolume usecase. Fixes: https://tracker.ceph.com/issues/70794 Signed-off-by: Kotresh HR <khiremat@redhat.com>
91913da to
d83b589
Compare
|
Rebased |
|
jenkins test api |
|
@vshankar @batrick Tried reproducing the original issue - https://tracker.ceph.com/issues/67102 locally by creating and deleting thousands of subvolume snapshots. The old inodes on the root and /volumes directory does increase but come down to zero after journal flush. The tracker information isn't helping much to reproduce. I will sync up with QE on this to reproduce this separately. I think the optimization done as part of this PR can go in while I work at the original issue separately. |
That makes sense. I will have a look again and put the change to test. |
|
This PR is under test in https://tracker.ceph.com/issues/71431. |
d83b589 to
cddbc55
Compare
@vshankar Fixed the snapshot clean up failure discussed in the above QA Tracker - https://tracker.ceph.com/issues/71431?issue_count=61&issue_position=3&next_issue_id=71428&prev_issue_id=71441#note-3 |
Fixes: https://tracker.ceph.com/issues/70794 Signed-off-by: Kotresh HR <khiremat@redhat.com>
Fixes: https://tracker.ceph.com/issues/70794 Signed-off-by: Kotresh HR <khiremat@redhat.com>
The snapshot data on any part of the subtree should be correct irrespective enabling/disabling 'mds_use_global_snaprealm_seq_for_subvol' even with a directory marked as subvolume. Fixes: https://tracker.ceph.com/issues/70794 Signed-off-by: Kotresh HR <khiremat@redhat.com>
cddbc55 to
57e1408
Compare
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
@gregsfortytwo This is made configurable. Re-requesting review |
|
@batrick ptal |
My concern about configuration was addressed!
batrick
left a comment
There was a problem hiding this comment.
@batrick @vshankar I added the tests with journal flush. The results are little surprising and it's not always a fixed number of old_inodes for /volumes or /volumes/group when root_snaps are involved. Please check the note here 8b60e67#diff-d3a7e3f3f24fff510b4d2a562b2093257b20b3908748c3432d14e460c449186bR513 where I tried explaining.
Okay, so this is purely an optimization at this point. I'm fine with it.
mds: Don't use global snaprealm seq for for subvolumes
Don't use global snaprealm seq number while doing cow
on old inodes which belongs to subvolume or under it i.e.,
for directories marked with 'ceph.dir.subvolume' vxattr. This is
safe because hardlinks and renames are not allowed beyond
these subvolume directories i.e., all the hardlink/renames
are contained within the same subvolume snaprealm and doesn't
cross the subvolume snaprealms
Fixes: https://tracker.ceph.com/issues/70794
Signed-off-by: Kotresh HR khiremat@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition