Skip to content

mds: for an irreparable hard link scrub is not able to identify the damage#61173

Merged
vshankar merged 3 commits intoceph:mainfrom
sajibreadd:wip-69345
Sep 29, 2025
Merged

mds: for an irreparable hard link scrub is not able to identify the damage#61173
vshankar merged 3 commits intoceph:mainfrom
sajibreadd:wip-69345

Conversation

@sajibreadd
Copy link
Member

@sajibreadd sajibreadd commented Dec 24, 2024

Fixes: https://tracker.ceph.com/issues/69345
Signed-off-by: Md Mahamudur Rahaman Sajib mahamudur.sajib@croit.io

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@sajibreadd sajibreadd assigned vshankar and ifed01 and unassigned vshankar and ifed01 Dec 24, 2024
@github-actions github-actions bot added the cephfs Ceph File System label Dec 24, 2024
@sajibreadd sajibreadd force-pushed the wip-69345 branch 3 times, most recently from b960918 to f8348ed Compare December 28, 2024 00:26
@sajibreadd
Copy link
Member Author

jenkins test make check

@gregsfortytwo
Copy link
Member

Can you please explain (in the commit message) how you're cleaning up these remote dentries in scrub? How can you validate the link validity if it's linked to an inode where the auth is a remote server?

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 7, 2025

Can you please explain (in the commit message) how you're cleaning up these remote dentries in scrub? How can you validate the link validity if it's linked to an inode where the auth is a remote server?

@gregsfortytwo Commit message is updated. Mostly followed the code already written in mds by understanding it from high level(not much experience with cephFS). Feel free to share any insights you found.

@sajibreadd
Copy link
Member Author

@gregsfortytwo Are you reviewing it?

@vshankar vshankar requested a review from a team January 13, 2025 08:04
@vshankar
Copy link
Contributor

Can you please explain (in the commit message) how you're cleaning up these remote dentries in scrub? How can you validate the link validity if it's linked to an inode where the auth is a remote server?

@gregsfortytwo Commit message is updated. Mostly followed the code already written in mds by understanding it from high level(not much experience with cephFS). Feel free to share any insights you found.

So, the issue if I understand it is that once the primary inode is gone (corrupted, gone, etc..), scrubbing the secondary inode isn't detecting the damage, right? If that's the case, there are two options: a) scrub the tree where the primary inode resides or b) when scrub encounters a remote link, try to open the link which would detect the faulty inode and mark the inode as damaged. What this change implements, is option (b), right?

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 13, 2025

Can you please explain (in the commit message) how you're cleaning up these remote dentries in scrub? How can you validate the link validity if it's linked to an inode where the auth is a remote server?

@gregsfortytwo Commit message is updated. Mostly followed the code already written in mds by understanding it from high level(not much experience with cephFS). Feel free to share any insights you found.

So, the issue if I understand it is that once the primary inode is gone (corrupted, gone, etc..), scrubbing the secondary inode isn't detecting the damage, right? If that's the case, there are two options: a) scrub the tree where the primary inode resides or b) when scrub encounters a remote link, try to open the link which would detect the faulty inode and mark the inode as damaged. What this change implements, is option (b), right?

Yes, but if repair flag is on, it will just clean up the dentry and cleanup damage(This part I am not absolutely confident about my code, I just followed the server's code of removing remote link). Otherwise, everytime user stat or hit that inode it will create a damage message.

@vshankar
Copy link
Contributor

Can you please explain (in the commit message) how you're cleaning up these remote dentries in scrub? How can you validate the link validity if it's linked to an inode where the auth is a remote server?

@gregsfortytwo Commit message is updated. Mostly followed the code already written in mds by understanding it from high level(not much experience with cephFS). Feel free to share any insights you found.

So, the issue if I understand it is that once the primary inode is gone (corrupted, gone, etc..), scrubbing the secondary inode isn't detecting the damage, right? If that's the case, there are two options: a) scrub the tree where the primary inode resides or b) when scrub encounters a remote link, try to open the link which would detect the faulty inode and mark the inode as damaged. What this change implements, is option (b), right?

Yes, but if repair flag is on, it will just clean up the dentry and cleanup damage(This part I am not absolutely confident about my code, I just followed the server's code of removing remote link). Otherwise, everytime user stat or hit that inode it will create a damage message.

[cc @mchangir who has worked on cephfs scrub lately]

From the commit message of this change:

for a remote link, tried to open the dentry(this part of code is copied from the mds path_traversal)
which internally push the damage in the damage list if applicable. For damaged remote link, a callback
context is called if repair flag is on, which clean up the dentry. To clean up the dentry unlinked it
marked the dentry dirty and unlinked it from the mdcache(followed the code from Server::_link_remote_finish)

I'm not 100% what's being done here. From my understanding, scrub code can (try to) lookup the remote inode for any damage and add it to the damage table. To repair a damage, the scrub command should be forwarded to the inode auth mds.

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 13, 2025

I'm not 100% what's being done here. From my understanding, scrub code can (try to) lookup the remote inode for any damage and add it to the damage table. To repair a damage, the scrub command should be forwarded to the inode auth mds.

As for this dentry, it is not pointing to any inode, more specifically dnl->get_inode() is returning null pointer, so how do I get the auth mds? That's why I marked the dentry dirty which eventually remove omap entry(*_head) from the omap of the object associated with the parent directory. Can you also check the reproduction step described in the tracker?

@vshankar
Copy link
Contributor

I'm not 100% what's being done here. From my understanding, scrub code can (try to) lookup the remote inode for any damage and add it to the damage table. To repair a damage, the scrub command should be forwarded to the inode auth mds.

As for this dentry, it is not pointing to any inode, more specifically dnl->get_inode() is returning null pointer, so how do I get the auth mds?

That's because it's a remote inode. If you see the definition of that in CDentry:

    // dentry type is primary || remote || null
    // inode ptr is required for primary, optional for remote, undefined for null
    bool is_primary() const { return remote_ino == 0 && inode != 0; }
    bool is_remote() const { return remote_ino > 0; }
    bool is_null() const { return remote_ino == 0 && inode == 0; }

A remote inode will have remote_ino field set to the inode number of the primary. So, to get the auth, you'd need to fetch the inode MDCache (it could be the auth inode or a replica) and then find the auth MDS.

That's why I marked the dentry dirty which eventually remove omap entry(*_head) from the omap of the object associated with the parent directory.

This is the part which is not 100% clear to me. Why do we want to remove the omap entry? Is that the way repair should be done? You are loosing the linkage altogether by doing that.

Can you also check the reproduction step described in the tracker?

Yes, I did. The steps detail that once a scrub is done, the damage list is not populated. If the scrub is done on a directory where the secondary link is present and the primary link is in different directory, then , as you pointed out, scrub does not do anything. But if scrub is run of root, then the damage list should be populated.

So, the question is, if scrub is run on a secondary hardlink directory with repair flag, what should be done? Should it repair the damaged (primary) inode which is in another directory? Scrub should at least identify the damage IMO.

cc @mchangir

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 15, 2025

So, the question is, if scrub is run on a secondary hardlink directory with repair flag, what should be done? Should it repair the damaged (primary) inode which is in another directory? Scrub should at least identify the damage IMO.

@vshankar thanks for explaining. Even we scrub the primary inode, it will not recover (As from the test case I deleted the object, let me know if I am missing something). Then how should user get out of this loop. Because everytime user will stat that hardlink(which they assume it exists), it will create damage. Let's say we identify the damage, but what should be the way to get rid of this damage. That's why I was talking about removing the dentry from the dentry list of it's parent directory. (I need to study a bit). If there is any manual way of getting out of this damage, can you share?

@vshankar
Copy link
Contributor

So, the question is, if scrub is run on a secondary hardlink directory with repair flag, what should be done? Should it repair the damaged (primary) inode which is in another directory? Scrub should at least identify the damage IMO.

@vshankar thanks for explaining. Even we scrub the primary inode, it will not recover (As from the test case I deleted the object, let me know if I am missing something). Then how should user get out of this loop.

There are damages that scrub can fix and there are some which it cannot. When you remove the inode itself, the only way to recover is to get back the inode by scanning the data pool by reading backtraces and creating the dentry linkages. This is performed using cephfs-data-scan tool. See: https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#recovery-from-missing-metadata-objects

Because everytime user will stat that hardlink(which they assume it exists), it will create damage. Let's say we identify the damage, but what should be the way to get rid of this damage.

Right. So, its either via scrub (w/ repair flag), or if the damage ins't fixable by scrub, then recover the metadata using the repair tools.

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 15, 2025

Right. So, its either via scrub (w/ repair flag), or if the damage ins't fixable by scrub, then recover the metadata using the repair tools.

So in your opinion, we shouldn't cleanup/repair it this way right? Then I will remove the repair part and keep the identification part.

@sajibreadd
Copy link
Member Author

Right. So, its either via scrub (w/ repair flag), or if the damage ins't fixable by scrub, then recover the metadata using the repair tools.

Also do you think rm <hardlink> should work here?(without using repair tool)

@vshankar
Copy link
Contributor

vshankar commented Jan 15, 2025

Right. So, its either via scrub (w/ repair flag), or if the damage ins't fixable by scrub, then recover the metadata using the repair tools.

Also do you think rm <hardlink> should work here?(without using repair tool)

That does a path-walk on the MDS and if the inode is missing, then are bets are off...

EDIT: s/are/all

@patrakov
Copy link

Right. So, its either via scrub (w/ repair flag), or if the damage ins't fixable by scrub, then recover the metadata using the repair tools.

Also do you think rm <hardlink> should work here?(without using repair tool)

That does a path-walk on the MDS and if the inode is missing, then are bets are off...

Speaking on behalf of the affected user.

We know where the original inode was, and that directory has been deleted, presumably by a script. So we know that the hardlink is truly dangling.

I personally don't care about the exact way that we should follow to clear the damage. I.e., so that "cat log.run.1" says "No such file or directory", not "I/O error". It can be rm. It can be some other tool or procedure. But this procedure should exist.

If this helps, I would advise @sajibreadd to split this MR into two:

  1. This MR should add detection of this kind of damage so that the scrub finds all instances of it. Right now, without the patch, it finds none. No attempts to repair should be added to the scrub code, as the scrub cannot know whether the original inode exists or not.
  2. A separate MR (possibly to the documentation only) that outlines a path for cleaning out the known dangling hardlinks as if they never existed.

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 15, 2025

@vshankar Should it repair the damaged (primary) inode which is in another directory(if possible with repair flag)? Should we do that or keep just the identification part in this PR?

@mchangir
Copy link
Contributor

@vshankar Should it repair the damaged (primary) inode which is in another directory(if possible with repair flag)? Should we do that or keep just the identification part in this PR?

In my opinion, scrub should not attempt to repair things from secondary access paths. If at all, only the path to the primary path should be dumped. But if the inode itself is missing then we won't have the backtrace to the original path either. So, Venky's suggestion to rm <hardlink> via appropriate tool should suffice to heal the damage.

@sajibreadd
Copy link
Member Author

sajibreadd commented Jan 16, 2025

In my opinion, scrub should not attempt to repair things from secondary access paths. If at all, only the path to the primary path should be dumped. But if the inode itself is missing then we won't have the backtrace to the original path either. So, Venky's suggestion to rm <hardlink> via appropriate tool should suffice to heal the damage.

So, I will just keep the identification part that means after scrub user will get these broken links in the damage list.
According to this comment #61173 (comment), recovery tool is not helping to remove that dangling link as the primary inode metadata object is deleted/lost. Then how can we clean up the damage. Are you referring to fixing the recovery tool, such that we can heal it for this case(Sorry, I have not much idea about the recovery tool, I need to study)?

@sajibreadd
Copy link
Member Author

sajibreadd commented Feb 13, 2025

@vshankar @mchangir @kotreshhr In general I have a question,

void MDCache::open_remote_dentry(CDentry *dn, bool projected, MDSContext *fin, bool want_xlocked)
{
  dout(10) << "open_remote_dentry " << *dn << dendl;
  CDentry::linkage_t *dnl = projected ? dn->get_projected_linkage() : dn->get_linkage();
  inodeno_t ino = dnl->get_remote_ino();
  int64_t pool = dnl->get_remote_d_type() == DT_DIR ? mds->get_metadata_pool() : -1;
  open_ino(ino, pool,
      new C_MDC_OpenRemoteDentry(this, dn, ino, fin, want_xlocked), true, want_xlocked); // backtrace
}

You can see there is a function open_ino,
As it is a file we are sending pool = -1 as parameter.

void MDCache::open_ino(inodeno_t ino, int64_t pool, MDSContext* fin,
		       bool want_replica, bool want_xlocked,
		       vector<inode_backpointer_t> *ancestors_hint,
		       mds_rank_t auth_hint)
{
  dout(10) << "open_ino " << ino << " pool " << pool << " want_replica "
	   << want_replica << dendl;

  auto it = opening_inodes.find(ino);
  if (it != opening_inodes.end()) {
    open_ino_info_t& info = it->second;
    if (want_replica) {
      info.want_replica = true;
      if (want_xlocked && !info.want_xlocked) {
	if (!info.ancestors.empty()) {
	  CInode *diri = get_inode(info.ancestors[0].dirino);
	  if (diri) {
	    frag_t fg = diri->pick_dirfrag(info.ancestors[0].dname);
	    CDir *dir = diri->get_dirfrag(fg);
	    if (dir && !dir->is_auth()) {
	      filepath path(info.ancestors[0].dname, 0);
	      discover_path(dir, CEPH_NOSNAP, path, NULL, true);
	    }
	  }
	}
	info.want_xlocked = true;
      }
    }
    info.waiters.push_back(fin);
  } else {
    open_ino_info_t& info = opening_inodes[ino];
    info.want_replica = want_replica;
    info.want_xlocked = want_xlocked;
    info.tid = ++open_ino_last_tid;
    info.pool = pool >= 0 ? pool : default_file_layout.pool_id; //------> why?
    info.waiters.push_back(fin);
    if (auth_hint != MDS_RANK_NONE)
      info.auth_hint = auth_hint;
    if (ancestors_hint) {
      info.ancestors = std::move(*ancestors_hint);
      info.fetch_backtrace = false;
      info.checking = mds->get_nodeid();
      _open_ino_traverse_dir(ino, info, 0);
    } else {
      do_open_ino(ino, info, 0);
    }
  }
}

if pool == -1 then why we are only searching in the default pool, why not other fs associated pools or the pool set as the xattr? For example, if for that specific file ceph.layout.pool attribute is explicitly set to a different pool, then the default pool(e.g. cephfs_data) won't have the 0'th object, right?

@vshankar
Copy link
Contributor

if pool == -1 then why we are only searching in the default pool, why not other fs associated pools or the pool set as the xattr? For example, if for that specific file ceph.layout.pool attribute is explicitly set to a different pool, then the default pool(e.g. cephfs_data) won't have the 0'th object, right?

The default data pool will always have the 0th object which has the file backtrace and layout even if custom data pool layout is set.

@vshankar
Copy link
Contributor

@sajibreadd ping?

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@sajibreadd
Copy link
Member Author

@vshankar I added the test following the reproducer. Regarding notes in doc/cephfs/scrub.rst, not sure it can be a suitable place. Can you suggest what kind of note you want in this doc, a hint may be?

@sajibreadd
Copy link
Member Author

jenkins test api

@sajibreadd
Copy link
Member Author

@vshankar any thoughts?

@vshankar
Copy link
Contributor

@vshankar I added the test following the reproducer. Regarding notes in doc/cephfs/scrub.rst, not sure it can be a suitable place. Can you suggest what kind of note you want in this doc, a hint may be?

OK. Let's start by breaking the change into different commits - one for the fix itself, one for the test addition and another for the doc changes (doc/cephfs/scrub.rst is the place for this). Also add a note in PendingReleaseDocs.

@mchangir PTAL when this is ready.

@sajibreadd
Copy link
Member Author

@vshankar @mchangir separated the commit. Let me know if some adjustment required.

@sajibreadd
Copy link
Member Author

@vshankar @mchangir Any update?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@vshankar
Copy link
Contributor

@vshankar @mchangir Any update?

Sorry, no. I will have a look this week.

def __str__(self):
return "Admin socket: {command} failed with rc={rc} json output={json}, because '{es}'".format(
command=self.command, rc=self.rc, json=self.json, es=self.errstring)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change. please revert this hunk.

If scrub is able to repair the damage, the corresponding entry is automatically
removed from the damage table.

Note: Scrub with repair flag can idetify the hard link damage but can not repair it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scrub invoked with repair option can identify an damaged hard link but not repair it.

rados_obj_file = "{ino:x}.00000000".format(ino=file_ino)
rados_obj_dir = "{ino:x}.00000000".format(ino=dir_ino)
self.fs.flush()
time.sleep(20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sleep?

self.fs.rados(["rm", rados_obj_file], pool=self.fs.get_data_pool_name())
self.fs.rados(["rm", rados_obj_dir], pool=self.fs.get_metadata_pool_name())
self.fs.fail()
time.sleep(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sleep? I you need to verify that the MDSs are no longer active, then there are constructs in class Filesystem that have helpers for this purpose.

test_dir_path = "test_dir"
self.mount_a.run_shell(["mkdir", test_dir_path])

file_dir_path = test_dir_path + "/file_dir"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use os.path.join()


file_dir_path = test_dir_path + "/file_dir"
self.mount_a.run_shell(["mkdir", file_dir_path])
file_path = file_dir_path + "/test_file.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

self.fs.rados(["rm", rados_obj_dir], pool=self.fs.get_metadata_pool_name())
self.fs.fail()
time.sleep(10)
self.fs.set_joinable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is doing is to startup the MDS with an empty cache. Why not failover the MDS in that case?

dout(4) << "scrub: remote dentry points to damaged ino " << *dn << dendl;
return;
}
mdcache->open_remote_dentry(dn, true, nullptr);
Copy link
Contributor

@vshankar vshankar Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this today after a long delay. Sorry for that!

Doesn't passing nullptr to open_remote_dentry() results in a fault since _open_remote_dentry_finish does fin->complete(...)?

If scrub is able to repair the damage, the corresponding entry is automatically
removed from the damage table.

Note: scrub invoked with ``repair`` option can identify an damaged hard link but not repair it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: A scrub invoked with the

@sajibreadd
Copy link
Member Author

sajibreadd commented Sep 2, 2025

Doesn't passing nullptr to open_remote_dentry() results in a fault since _open_remote_dentry_finish does fin->complete(...)?

@vshankar Initially I added a nullptr checks, but now I just replaced it with C_MDSInternalNoop.

Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs lgtm, I'm not qualified to approve the code.

@sajibreadd
Copy link
Member Author

@vshankar a friendly ping

@vshankar
Copy link
Contributor

vshankar commented Sep 9, 2025

@vshankar a friendly ping

Sure 👍

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will run this through tests.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/73082.

For a remote link, try to open the dentry (this part of code
is copied from the mds path_traversal) which internally pushes
the dir_frag in the damage list if applicable.

Fixes: https://tracker.ceph.com/issues/69345
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
…crub

Also update ignorelist with the expected cluster warning.

Fixes: https://tracker.ceph.com/issues/69345
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
Fixes: https://tracker.ceph.com/issues/69345
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
@vshankar
Copy link
Contributor

Added the expected cluster warning [object missing on disk] in ignore list and cleaned up commit messaged.

@vshankar
Copy link
Contributor

jenkins test windows

@vshankar
Copy link
Contributor

This is nearly good to go 👍

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants