Skip to content

mds: just skip checking the replicated ancestor inodes#57962

Closed
lxbsz wants to merge 4 commits intoceph:mainfrom
lxbsz:wip-57656
Closed

mds: just skip checking the replicated ancestor inodes#57962
lxbsz wants to merge 4 commits intoceph:mainfrom
lxbsz:wip-57656

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Jun 11, 2024

For the replicaed ancestor inodes the backtrace in disk maybe newer
than the one in replica MDS' memory. We should just ignore these
ancestor inodes and only compare the inodes whose author is local
MDS.

Fixes: https://tracker.ceph.com/issues/57656

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@lxbsz lxbsz added the cephfs Ceph File System label Jun 11, 2024
@lxbsz lxbsz requested review from a team, batrick and vshankar June 11, 2024 05:39
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test for this is actually simple to construct:

  • revert venky's PR
  • Test:
    -- mkdir foo/ && pushd foo && for i in $(seq 1 1000); do mkdir $i ; done
    -- flush all mds journals
    -- trigger nest updates on the rank auth for dirfrag foo (making files in a subdir it's also auth for should work)
    -- flush journal for rank that is auth for foo dirfrag
  • verify the scrub error is generated
  • Then verify your changes post-revert fix it

@lxbsz
Copy link
Member Author

lxbsz commented Jun 12, 2024

I think a test for this is actually simple to construct:

  • revert venky's PR
  • Test:
    -- mkdir foo/ && pushd foo && for i in $(seq 1 1000); do mkdir $i ; done
    -- flush all mds journals
    -- trigger nest updates on the rank auth for dirfrag foo (making files in a subdir it's also auth for should work)
    -- flush journal for rank that is auth for foo dirfrag
  • verify the scrub error is generated
  • Then verify your changes post-revert fix it

Sounds good and let me do it. Thanks @batrick

@github-actions github-actions bot added the tests label Jun 13, 2024
@lxbsz
Copy link
Member Author

lxbsz commented Jun 13, 2024

I think a test for this is actually simple to construct:

  • revert venky's PR
  • Test:
    -- mkdir foo/ && pushd foo && for i in $(seq 1 1000); do mkdir $i ; done
    -- flush all mds journals
    -- trigger nest updates on the rank auth for dirfrag foo (making files in a subdir it's also auth for should work)
    -- flush journal for rank that is auth for foo dirfrag
  • verify the scrub error is generated
  • Then verify your changes post-revert fix it

@batrick Added the test case for it and I can reproduce it successfully by using the steps in https://tracker.ceph.com/issues/57656#note-20.

@lxbsz lxbsz force-pushed the wip-57656 branch 2 times, most recently from df14cf5 to 3457774 Compare June 13, 2024 12:33
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer for the commit ordering that it be:

  • revert Venky's change (necessary to re-introduce the bug)
  • add the qa test (it should fail when ceph is built with this commit)
  • add the fix commits (test should pass when built with these commits)

@lxbsz
Copy link
Member Author

lxbsz commented Jun 14, 2024

I prefer for the commit ordering that it be:

  • revert Venky's change (necessary to re-introduce the bug)
  • add the qa test (it should fail when ceph is built with this commit)
  • add the fix commits (test should pass when built with these commits)

Okay, will do it.

@lxbsz
Copy link
Member Author

lxbsz commented Jun 14, 2024

Updated it and simplified the test.

@lxbsz lxbsz force-pushed the wip-57656 branch 5 times, most recently from 5b27b7c to 0a3ab89 Compare June 17, 2024 02:30
@lxbsz
Copy link
Member Author

lxbsz commented Jun 17, 2024

self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0)

# Export testdir3 to mds.1
self.fs.rank_asok(["export", "dir", f"/{test_path3}", "1"], rank=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not work because test_path2 is pinned to rank 0? You should pin test_path3 instead. Then call wait_for_subtrees before continuing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test fails now despite this export failing -- and I think it should -- then that indicates this export is not necessary after all?

Copy link
Member Author

@lxbsz lxbsz Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning test_path3 doesn't work. And we need to export test_path3 to rank 1 anyway, else I couldn't reproduce it.

I have tried to pin either test_path3 and test_path2 to rank 0, both will fail the exporting:

2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinned

Unless there is no any pinning in the path of testdir1/testdir2/testdir3.

Why we need to pin here is because we want to make sure before exporting all the path testdir1/testdir2/testdir3's auth is rank 0.

I think we can just pin them to rank 0 first and then remove the pinning before exporting.

Make sense ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning test_path3 doesn't work. And we need to export test_path3 to rank 1 anyway, else I couldn't reproduce it.

I have tried to pin either test_path3 and test_path2 to rank 0, both will fail the exporting:

2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinned

Unless there is no any pinning in the path of testdir1/testdir2/testdir3.

If you pin test_path3, it should override the pin on test_path1. So this is confusing. I see you're also unpinning test_path1 but I'm not sure why that is necessary for your test.

Perhaps the issue with pinning test_path3 is that it triggers some lock updates to rank 1?

Perhaps instead of using pinning then: turn off the balancer and use the export asok command to move things as needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning test_path3 doesn't work. And we need to export test_path3 to rank 1 anyway, else I couldn't reproduce it.
I have tried to pin either test_path3 and test_path2 to rank 0, both will fail the exporting:
2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinned
Unless there is no any pinning in the path of testdir1/testdir2/testdir3.

If you pin test_path3, it should override the pin on test_path1.

Yeah, this is correct.

So this is confusing. I see you're also unpinning test_path1 but I'm not sure why that is necessary for your test.

This is because I need to dirty the metadata in rank 0 first and then export the test_path3 to rank 1 later. If I do not remove the pinning from its ancestor the exporting will fail. More detail please see export_dir() --> dir->is_exportable(dest) --> get_export_pin() --> inode->get_export_pin(inherit):

5485 mds_rank_t CInode::get_export_pin(bool inherit) const
5486 { 
5487   auto&& balancer = mdcache->mds->balancer;
5488   auto export_pin = balancer->get_bal_export_pin();
5489   if (!export_pin)
5490     return MDS_RANK_NONE;
5491   
5492   /* An inode that is export pinned may not necessarily be a subtree root, we
5493    * need to traverse the parents. A base or system inode cannot be pinned.
5494    * N.B. inodes not yet linked into a dir (i.e. anonymous inodes) will not
5495    * have a parent yet.
5496    */
5497   mds_rank_t r_target = MDS_RANK_NONE;
5498   const CInode *in = this;
5499   const CDir *dir = nullptr;                                                                                                                                                                                                                             
5500   while (true) {
5501     if (in->is_system())
5502       break;
5503     const CDentry *pdn = in->get_parent_dn(); 
5504     if (!pdn)
5505       break;
5506     if (in->get_inode()->nlink == 0) {  
5507       // ignore export pin for unlinked directory
5508       break;
5509     }
5510   
5511     if (in->get_inode()->export_pin >= 0) {
5512       return in->get_inode()->export_pin; 
...

If any of its ancestor is pinned to rank 0 the exporting will be not allowed, please Line#5511 and Line#5512

Perhaps the issue with pinning test_path3 is that it triggers some lock updates to rank 1?

No, the pinned dir is not allowed to be exported to a different rank.

Perhaps instead of using pinning then: turn off the balancer and use the export asok command to move things as needed.

self.mount_a.run_shell_payload(f"cd {test_path3} && for i in `seq 1 32`; do mkdir dir_$i; done;")

# Flush path 'testdir3/file_0' and this will persist the backtrace to disk
self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just flush rank 0's journal? I'm dubious about only flushing a specific path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this won't work. I only need to flush the testdir3/file_0, if we flush the whole journal it will also flush the ancestors'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a comment I've been working on here for the last several days but at this point I don't want to lose it as I've been working on other things:

So this seems to be a key issue here. I'm not sure how in the original issue that testdir3/file_0 would otherwise have its backtrace flushed. In general, journal trimming will unscatter the nestlock on parents as it commits to disk. That should flush the version to the replicas. I think we're missing a subtlety here.

This makes me think that the attractive option of "ignoring replicated metadata with out-of-date version" may actually be hiding a genuine bug.

Xiubo: I didn't get your test to run without some modification. I've since made numerous changes in an effort to simplify reproduction and forgot what needed adjustment in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a comment I've been working on here for the last several days but at this point I don't want to lose it as I've been working on other things:

So this seems to be a key issue here. I'm not sure how in the original issue that testdir3/file_0 would otherwise have its backtrace flushed. In general, journal trimming will unscatter the nestlock on parents as it commits to disk. That should flush the version to the replicas. I think we're missing a subtlety here.

This makes me think that the attractive option of "ignoring replicated metadata with out-of-date version" may actually be hiding a genuine bug.

Xiubo: I didn't get your test to run without some modification. I've since made numerous changes in an effort to simplify reproduction and forgot what needed adjustment in the first place.

How about when the journal event of testdir3 and file_0 are in different segments ? And only the file_0's events are flushed ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's possible but if you can construct an example that may be it.

Copy link
Member Author

@lxbsz lxbsz Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, does this patch work ? Could you see the scrub error ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't see it because the nestlock is unscattered (gathered) during journal trimming. (I was playing with journal configs to get it to trim faster.) For that reason, I'm not really sure how this bug is reproduced and unfortunately we've garbage collected the original logs in the tracker ticket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, maybe somewhere we should add some debug logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush_path is a dirty hack for a test to reproduce a problem arising from normal MDS operations. To my knowledge, it's not normally possible to flush an inode this way without also triggering a gather for parent inode's ifile/etc.

Again, we're missing something here that can only be explained with new debug logs with the original problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush_path is a dirty hack for a test to reproduce a problem arising from normal MDS operations. To my knowledge, it's not normally possible to flush an inode this way without also triggering a gather for parent inode's ifile/etc.

Again, we're missing something here that can only be explained with new debug logs with the original problem.

The flush_path is simple:

13464 void MDCache::flush_dentry_work(const MDRequestRef& mdr)
13465 {
13466   MutationImpl::LockOpVec lov;
13467   CInode *in = mds->server->rdlock_path_pin_ref(mdr, true);                                                                                                                                                                                             
13468   if (!in)
13469     return;
13470  
13471   ceph_assert(in->is_auth());
13472   in->flush(new C_FinishIOMDR(mds, mdr));
13473 }

And in this case in rdlock_path_pin_ref() it won't acquire any ancestor inodes' locks unless there the lock_cache is set. And then no any chance to trigger the gathers for parent inodes.

Did I miss something about this ?


# Scrub 'testdir3/file_0' and it will detect the backtrace in memory will be older and
# metadata damage will be reported via cluster health warnings.
out_json = self.fs.run_scrub(["start", f"/{test_file}", "recursive"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do a recursive,force scrub on / and reproduce?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only scrub on / also could be reproduced, but we must remove the force option instead. Else I couldn't reproduce it by trying many times.

@lxbsz
Copy link
Member Author

lxbsz commented Jun 20, 2024

jenkins retest this please

self.fs.wait_for_daemons()

test_path1 = "testdir1"
test_path2 = "testdir1/testdir2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_path2 = "testdir1/testdir2"
test_path2 = f"{test_path1}/testdir2"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix it.

test_file = f"{test_path3}/file_0"

# Make sure all the ancestors' auth is in rank 0
self.mount_a.run_shell(["mkdir", test_path1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.mount_a.run_shell(["mkdir", test_path1])
self.mount_a.run_shell(["mkdir", "-p", test_path4])

Let's try to simplify as much as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

# Make sure all the ancestors' auth is in rank 0
self.mount_a.run_shell(["mkdir", test_path1])
self.mount_a.setfattr(test_path1, "ceph.dir.pin", "0") # no surprise migrations
self.mount_a.run_shell(["mkdir", "-p", test_path4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.mount_a.run_shell(["mkdir", "-p", test_path4])


# Will trigger to load ancestor inodes in mds.1
self.mount_a.setfattr(test_path4, "ceph.dir.pin", "1")
# Will trigger to load ancestor inodes in mds.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. test_path4 will eventually exported because it's pinned but not right now because it's empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comments are duplicated. Actually the following touch will trigger to do the load here. I will fix the above comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comments are duplicated. Actually the following touch will trigger to do the load here. I will fix the above comment.

self.mount_a.run_shell_payload(f"cd {test_path3} && for i in `seq 1 32`; do mkdir dir_$i; done;")

# Flush path 'testdir3/file_0' and this will persist the backtrace to disk
self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush_path is a dirty hack for a test to reproduce a problem arising from normal MDS operations. To my knowledge, it's not normally possible to flush an inode this way without also triggering a gather for parent inode's ifile/etc.

Again, we're missing something here that can only be explained with new debug logs with the original problem.

self.fs.rank_asok(["flush_path", f"/{test_file}"], rank=0)

# Export testdir3 to mds.1
self.fs.rank_asok(["export", "dir", f"/{test_path3}", "1"], rank=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning test_path3 doesn't work. And we need to export test_path3 to rank 1 anyway, else I couldn't reproduce it.

I have tried to pin either test_path3 and test_path2 to rank 0, both will fail the exporting:

2024-06-20T09:01:24.731+0800 7f2a1b200640 7 mds.0.mig export_dir Cannot export to mds.1 [dir 0x1000000024f /mydir1/mydir2/mydir3/ [2,head] auth{1=1} v=35 cv=35/35 state=1074003969|complete f(v0 m2024-06-20T08:59:10.490762+0800 13=1+12) n(v0 rc2024-06-20T08:59:10.490762+0800 14=2+12) hs=13+0,ss=0+0 | child=1 replicated=1 dirty=0 authpin=0 0x5648800d0400]: dir is export pinned

Unless there is no any pinning in the path of testdir1/testdir2/testdir3.

If you pin test_path3, it should override the pin on test_path1. So this is confusing. I see you're also unpinning test_path1 but I'm not sure why that is necessary for your test.

Perhaps the issue with pinning test_path3 is that it triggers some lock updates to rank 1?

Perhaps instead of using pinning then: turn off the balancer and use the export asok command to move things as needed.

lxbsz added 2 commits July 29, 2024 13:51
…ancestor inodes"

This reverts commit b98bb86.

Fixes: https://tracker.ceph.com/issues/57656
Signed-off-by: Xiubo Li <xiubli@redhat.com>
When the backtrace on-disk is divergent it will miss setting the
'divergent'.

And at the same time remove the stale 'equivalent' parameter.

Fixes: https://tracker.ceph.com/issues/57656
Signed-off-by: Xiubo Li <xiubli@redhat.com>
lxbsz added 2 commits July 29, 2024 14:03
For the replicated ancestor inodes the backtrace in disk maybe newer
than the one in replica MDS' memory. We should just ignore these
ancestor inodes and only compare the inodes whose author is local
MDS.

Fixes: https://tracker.ceph.com/issues/57656
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 20, 2024
@vshankar vshankar requested a review from batrick January 15, 2025 16:45
@vshankar vshankar removed the stale label Jan 15, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jun 21, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants