mds: just skip checking the replicated ancestor inodes#57962
mds: just skip checking the replicated ancestor inodes#57962
Conversation
batrick
left a comment
There was a problem hiding this comment.
I think a test for this is actually simple to construct:
- revert venky's PR
- Test:
--mkdir foo/ && pushd foo && for i in $(seq 1 1000); do mkdir $i ; done
-- flush all mds journals
-- trigger nest updates on the rank auth for dirfragfoo(making files in a subdir it's also auth for should work)
-- flush journal for rank that is auth forfoodirfrag - verify the scrub error is generated
- Then verify your changes post-revert fix it
Sounds good and let me do it. Thanks @batrick |
@batrick Added the test case for it and I can reproduce it successfully by using the steps in https://tracker.ceph.com/issues/57656#note-20. |
df14cf5 to
3457774
Compare
batrick
left a comment
There was a problem hiding this comment.
I prefer for the commit ordering that it be:
- revert Venky's change (necessary to re-introduce the bug)
- add the qa test (it should fail when ceph is built with this commit)
- add the fix commits (test should pass when built with these commits)
Okay, will do it. |
|
Updated it and simplified the test. |
5b27b7c to
0a3ab89
Compare
| self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0) | ||
|
|
||
| # Export testdir3 to mds.1 | ||
| self.fs.rank_asok(["export", "dir", f"/{test_path3}", "1"], rank=0) |
There was a problem hiding this comment.
this should not work because test_path2 is pinned to rank 0? You should pin test_path3 instead. Then call wait_for_subtrees before continuing.
There was a problem hiding this comment.
If this test fails now despite this export failing -- and I think it should -- then that indicates this export is not necessary after all?
There was a problem hiding this comment.
Pinning test_path3 doesn't work. And we need to export test_path3 to rank 1 anyway, else I couldn't reproduce it.
I have tried to pin either test_path3 and test_path2 to rank 0, both will fail the exporting:
2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinned
Unless there is no any pinning in the path of testdir1/testdir2/testdir3.
Why we need to pin here is because we want to make sure before exporting all the path testdir1/testdir2/testdir3's auth is rank 0.
I think we can just pin them to rank 0 first and then remove the pinning before exporting.
Make sense ?
There was a problem hiding this comment.
Pinning
test_path3doesn't work. And we need to exporttest_path3torank 1anyway, else I couldn't reproduce it.I have tried to pin either
test_path3andtest_path2torank 0, both will fail the exporting:
2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinnedUnless there is no any pinning in the path of
testdir1/testdir2/testdir3.
If you pin test_path3, it should override the pin on test_path1. So this is confusing. I see you're also unpinning test_path1 but I'm not sure why that is necessary for your test.
Perhaps the issue with pinning test_path3 is that it triggers some lock updates to rank 1?
Perhaps instead of using pinning then: turn off the balancer and use the export asok command to move things as needed.
There was a problem hiding this comment.
Pinning
test_path3doesn't work. And we need to exporttest_path3torank 1anyway, else I couldn't reproduce it.
I have tried to pin eithertest_path3andtest_path2torank 0, both will fail the exporting:
2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinned
Unless there is no any pinning in the path oftestdir1/testdir2/testdir3.If you pin
test_path3, it should override the pin ontest_path1.
Yeah, this is correct.
So this is confusing. I see you're also unpinning
test_path1but I'm not sure why that is necessary for your test.
This is because I need to dirty the metadata in rank 0 first and then export the test_path3 to rank 1 later. If I do not remove the pinning from its ancestor the exporting will fail. More detail please see export_dir() --> dir->is_exportable(dest) --> get_export_pin() --> inode->get_export_pin(inherit):
5485 mds_rank_t CInode::get_export_pin(bool inherit) const
5486 {
5487 auto&& balancer = mdcache->mds->balancer;
5488 auto export_pin = balancer->get_bal_export_pin();
5489 if (!export_pin)
5490 return MDS_RANK_NONE;
5491
5492 /* An inode that is export pinned may not necessarily be a subtree root, we
5493 * need to traverse the parents. A base or system inode cannot be pinned.
5494 * N.B. inodes not yet linked into a dir (i.e. anonymous inodes) will not
5495 * have a parent yet.
5496 */
5497 mds_rank_t r_target = MDS_RANK_NONE;
5498 const CInode *in = this;
5499 const CDir *dir = nullptr;
5500 while (true) {
5501 if (in->is_system())
5502 break;
5503 const CDentry *pdn = in->get_parent_dn();
5504 if (!pdn)
5505 break;
5506 if (in->get_inode()->nlink == 0) {
5507 // ignore export pin for unlinked directory
5508 break;
5509 }
5510
5511 if (in->get_inode()->export_pin >= 0) {
5512 return in->get_inode()->export_pin;
...
If any of its ancestor is pinned to rank 0 the exporting will be not allowed, please Line#5511 and Line#5512
Perhaps the issue with pinning
test_path3is that it triggers some lock updates to rank 1?
No, the pinned dir is not allowed to be exported to a different rank.
Perhaps instead of using pinning then: turn off the balancer and use the export asok command to move things as needed.
| self.mount_a.run_shell_payload(f"cd {test_path3} && for i in `seq 1 32`; do mkdir dir_$i; done;") | ||
|
|
||
| # Flush path 'testdir3/file_0' and this will persist the backtrace to disk | ||
| self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0) |
There was a problem hiding this comment.
Can we not just flush rank 0's journal? I'm dubious about only flushing a specific path.
There was a problem hiding this comment.
No, this won't work. I only need to flush the testdir3/file_0, if we flush the whole journal it will also flush the ancestors'.
There was a problem hiding this comment.
Putting a comment I've been working on here for the last several days but at this point I don't want to lose it as I've been working on other things:
So this seems to be a key issue here. I'm not sure how in the original issue that testdir3/file_0 would otherwise have its backtrace flushed. In general, journal trimming will unscatter the nestlock on parents as it commits to disk. That should flush the version to the replicas. I think we're missing a subtlety here.
This makes me think that the attractive option of "ignoring replicated metadata with out-of-date version" may actually be hiding a genuine bug.
Xiubo: I didn't get your test to run without some modification. I've since made numerous changes in an effort to simplify reproduction and forgot what needed adjustment in the first place.
There was a problem hiding this comment.
Putting a comment I've been working on here for the last several days but at this point I don't want to lose it as I've been working on other things:
So this seems to be a key issue here. I'm not sure how in the original issue that
testdir3/file_0would otherwise have its backtrace flushed. In general, journal trimming will unscatter the nestlock on parents as it commits to disk. That should flush the version to the replicas. I think we're missing a subtlety here.This makes me think that the attractive option of "ignoring replicated metadata with out-of-date version" may actually be hiding a genuine bug.
Xiubo: I didn't get your test to run without some modification. I've since made numerous changes in an effort to simplify reproduction and forgot what needed adjustment in the first place.
How about when the journal event of testdir3 and file_0 are in different segments ? And only the file_0's events are flushed ?
There was a problem hiding this comment.
I don't think that's possible but if you can construct an example that may be it.
There was a problem hiding this comment.
BTW, does this patch work ? Could you see the scrub error ?
There was a problem hiding this comment.
No, I don't see it because the nestlock is unscattered (gathered) during journal trimming. (I was playing with journal configs to get it to trim faster.) For that reason, I'm not really sure how this bug is reproduced and unfortunately we've garbage collected the original logs in the tracker ticket.
There was a problem hiding this comment.
Okay, maybe somewhere we should add some debug logs.
There was a problem hiding this comment.
flush_path is a dirty hack for a test to reproduce a problem arising from normal MDS operations. To my knowledge, it's not normally possible to flush an inode this way without also triggering a gather for parent inode's ifile/etc.
Again, we're missing something here that can only be explained with new debug logs with the original problem.
There was a problem hiding this comment.
flush_pathis a dirty hack for a test to reproduce a problem arising from normal MDS operations. To my knowledge, it's not normally possible to flush an inode this way without also triggering a gather for parent inode's ifile/etc.Again, we're missing something here that can only be explained with new debug logs with the original problem.
The flush_path is simple:
13464 void MDCache::flush_dentry_work(const MDRequestRef& mdr)
13465 {
13466 MutationImpl::LockOpVec lov;
13467 CInode *in = mds->server->rdlock_path_pin_ref(mdr, true);
13468 if (!in)
13469 return;
13470
13471 ceph_assert(in->is_auth());
13472 in->flush(new C_FinishIOMDR(mds, mdr));
13473 }
And in this case in rdlock_path_pin_ref() it won't acquire any ancestor inodes' locks unless there the lock_cache is set. And then no any chance to trigger the gathers for parent inodes.
Did I miss something about this ?
qa/tasks/cephfs/test_scrub_checks.py
Outdated
|
|
||
| # Scrub 'testdir3/file_0' and it will detect the backtrace in memory will be older and | ||
| # metadata damage will be reported via cluster health warnings. | ||
| out_json = self.fs.run_scrub(["start", f"/{test_file}", "recursive"]) |
There was a problem hiding this comment.
I think you could do a recursive,force scrub on / and reproduce?
There was a problem hiding this comment.
If we only scrub on / also could be reproduced, but we must remove the force option instead. Else I couldn't reproduce it by trying many times.
|
jenkins retest this please |
| self.fs.wait_for_daemons() | ||
|
|
||
| test_path1 = "testdir1" | ||
| test_path2 = "testdir1/testdir2" |
There was a problem hiding this comment.
| test_path2 = "testdir1/testdir2" | |
| test_path2 = f"{test_path1}/testdir2" |
qa/tasks/cephfs/test_scrub_checks.py
Outdated
| test_file = f"{test_path3}/file_0" | ||
|
|
||
| # Make sure all the ancestors' auth is in rank 0 | ||
| self.mount_a.run_shell(["mkdir", test_path1]) |
There was a problem hiding this comment.
| self.mount_a.run_shell(["mkdir", test_path1]) | |
| self.mount_a.run_shell(["mkdir", "-p", test_path4]) |
Let's try to simplify as much as possible.
| # Make sure all the ancestors' auth is in rank 0 | ||
| self.mount_a.run_shell(["mkdir", test_path1]) | ||
| self.mount_a.setfattr(test_path1, "ceph.dir.pin", "0") # no surprise migrations | ||
| self.mount_a.run_shell(["mkdir", "-p", test_path4]) |
There was a problem hiding this comment.
| self.mount_a.run_shell(["mkdir", "-p", test_path4]) |
|
|
||
| # Will trigger to load ancestor inodes in mds.1 | ||
| self.mount_a.setfattr(test_path4, "ceph.dir.pin", "1") | ||
| # Will trigger to load ancestor inodes in mds.1 |
There was a problem hiding this comment.
I don't understand this comment. test_path4 will eventually exported because it's pinned but not right now because it's empty.
There was a problem hiding this comment.
The above comments are duplicated. Actually the following touch will trigger to do the load here. I will fix the above comment.
There was a problem hiding this comment.
The above comments are duplicated. Actually the following touch will trigger to do the load here. I will fix the above comment.
| self.mount_a.run_shell_payload(f"cd {test_path3} && for i in `seq 1 32`; do mkdir dir_$i; done;") | ||
|
|
||
| # Flush path 'testdir3/file_0' and this will persist the backtrace to disk | ||
| self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0) |
There was a problem hiding this comment.
flush_path is a dirty hack for a test to reproduce a problem arising from normal MDS operations. To my knowledge, it's not normally possible to flush an inode this way without also triggering a gather for parent inode's ifile/etc.
Again, we're missing something here that can only be explained with new debug logs with the original problem.
| self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0) | ||
|
|
||
| # Export testdir3 to mds.1 | ||
| self.fs.rank_asok(["export", "dir", f"/{test_path3}", "1"], rank=0) |
There was a problem hiding this comment.
Pinning
test_path3doesn't work. And we need to exporttest_path3torank 1anyway, else I couldn't reproduce it.I have tried to pin either
test_path3andtest_path2torank 0, both will fail the exporting:
2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinnedUnless there is no any pinning in the path of
testdir1/testdir2/testdir3.
If you pin test_path3, it should override the pin on test_path1. So this is confusing. I see you're also unpinning test_path1 but I'm not sure why that is necessary for your test.
Perhaps the issue with pinning test_path3 is that it triggers some lock updates to rank 1?
Perhaps instead of using pinning then: turn off the balancer and use the export asok command to move things as needed.
…ancestor inodes" This reverts commit b98bb86. Fixes: https://tracker.ceph.com/issues/57656 Signed-off-by: Xiubo Li <xiubli@redhat.com>
When the backtrace on-disk is divergent it will miss setting the 'divergent'. And at the same time remove the stale 'equivalent' parameter. Fixes: https://tracker.ceph.com/issues/57656 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Fixes: https://tracker.ceph.com/issues/57656 Signed-off-by: Xiubo Li <xiubli@redhat.com>
For the replicated ancestor inodes the backtrace in disk maybe newer than the one in replica MDS' memory. We should just ignore these ancestor inodes and only compare the inodes whose author is local MDS. Fixes: https://tracker.ceph.com/issues/57656 Signed-off-by: Xiubo Li <xiubli@redhat.com>
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
For the replicaed ancestor inodes the backtrace in disk maybe newer
than the one in replica MDS' memory. We should just ignore these
ancestor inodes and only compare the inodes whose author is local
MDS.
Fixes: https://tracker.ceph.com/issues/57656
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e