Bug #67102: mds: read-only file system due to large old_inodes - CephFS - Ceph

Actions

Copy link

Bug #67102

open

mds: read-only file system due to large old_inodes

Added by Venky Shankar over 1 year ago. Updated 9 months ago.

Status:

In Progress

Priority:

Immediate

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

% Done:

Source:

other

Backport:

reef,squid

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

This is being hit by downstream QE. During OMAP commit, the mds collects set of items to update and remove for a directory inode. In one case, for the volumes directory, the number of old_inodes is large. E.g.:

2024-07-18T09:21:23.451+0000 7ff39d281640 10 mds.0.cache.dir(0x1) _omap_commit
2024-07-18T09:21:23.451+0000 7ff39d281640 10 mds.0.cache.dir(0x1)  set volumes [dentry #0x1/volumes [2,head] auth (dversion lock) v=1602786 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x5588d5be0500]
2024-07-18T09:21:23.451+0000 7ff39d281640 14 mds.0.cache.dir(0x1)  dn 'volumes' inode [inode 0x10000000000 [...fa85a,head] /volumes/ auth v1602786 f(v0 m2024-07-12T21:27:05.326445+0000 86=82+4) n(v118053 rc2024-07-17T12:56:02.435948+0000 b994986246094 rs129 300972=98111+202861) old_inodes=671000 (inest lock dirty) | dirtyscattered=1 dirfrag=1 dirty=1 0x5588d5fe2b00]

As it can be seen: old_inodes=671000. CInode::old_inodes is:

``
old_inode_map_const_ptr old_inodes; // key = last, value.first = first
```

where `old_inode_map_const_ptr' is:

```
using mempool_old_inode = old_inode_t<mempool::mds_co::pool_allocator>;
using mempool_old_inode_map = mempool::mds_co::map<snapid_t, mempool_old_inode>;
using old_inode_map_ptr = std::shared_ptr<mempool_old_inode_map>;
using old_inode_map_const_ptr = std::shared_ptr<const mempool_old_inode_map>;
```

Basically a map b/w snapid and an inode. So, this has to be related to snapshots and my guess is there are too many snapshots causing this encoded inode (into buffer list) to blow up the osd_op size.

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Venky Shankar over 1 year ago

In the cluster where this issue was observed, dump snaps showed

The snapid values look legit

ceph tell mds.0 dump snaps
{
    "last_created": 1026138,
    "last_destroyed": 1026134,

From the inode dump:

inode [inode 0x10000000000 [...fa85a,head]

0xfa85a==1026138, so the inode first value is legit. Question is why old_inodes is such a huge value for volumes directory since snapshots are taken under the subvolume directory.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

So, this is what I see happening - if a snapshot is taken (say) at the root directory followed by snapshots on directories under that tree, old_inodes continues to build up on the root directory inode -- increasing each time a snapshot is taken on any directory under the tree, due to copy-on-write I believe.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Finally, I got to digging into this today. The repetitive mutation of old_inodes is caused by the global snap realm. The relevant function that is involved with this is in CInode.cc:

void CInode::pre_cow_old_inode()
{
  snapid_t follows = mdcache->get_global_snaprealm()->get_newest_seq();
  dout(20) << __func__ << " follows " << follows << " on " << *this << dendl;
  if (first <= follows)
    cow_old_inode(follows, true);
}

Callers to CInode::pre_cow_old_inode() are:

src/mds/Locker.cc:  in->pre_cow_old_inode();  // avoid cow mayhem
src/mds/MDCache.cc:    pin->pre_cow_old_inode();  // avoid cow mayhem!

The interesting one here is the call in MDCache::predirty_journal_parents(), thereby invoking CInode::pre_cow_old_inode() for each ancestor path upto root. This is fine, however, the condition that triggers COWing old_inodes in CInode::pre_cow_old_inode() is first <= follows check where follows is the sequence number (SnapRealm::cached_seq) of the global snap realm. So, when sub-directory snapshots are taken, the sequence number of the global snap realm gets incremented and then MDCache::predirty_journal_parents() is going to COW old_inodes for ancestors which are snapshotted.

I also think it has always been this way, at least from the time when global snap realm got introduced.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

So, there is special handling of global snap realm in SnapRealm.cc. In SnapRealm::check_cache()

  if (global || srnode.is_parent_global()) {
    last_created = mdcache->mds->snapclient->get_last_created();
    seq = std::max(last_created, last_destroyed);
  } else {
    last_created = srnode.last_created;
    seq = srnode.seq;
  }
  if (cached_seq >= seq &&
      cached_last_destroyed == last_destroyed)
    return;

  cached_snap_context.clear();

  cached_seq = seq;
  cached_last_created = last_created;
  cached_last_destroyed = last_destroyed;

For global snap realm, the sequence number is the max value of the last created or destroyed snapshot in the system and not for the snap realm itself. Creating and deleting snapshots anywhere in the tree is going to increment the sequence number of the global snap realm. This along with the check mentioned in note-3 (in CInode::pre_cow_old_inode()) is triggering buildup of old_inodes map.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Greg nudged me on this recently. So, there seems to be some relation between the global snap realm and nested snapshots since the issue of old_inodes blowing up isn't seen when nested directory snapshots are not taken.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Venky Shankar wrote in #note-4:

So, there is special handling of global snap realm in SnapRealm.cc. In SnapRealm::check_cache()

[...]

For global snap realm, the sequence number is the max value of the last created or destroyed snapshot in the system and not for the snap realm itself. Creating and deleting snapshots anywhere in the tree is going to increment the sequence number of the global snap realm. This along with the check mentioned in note-3 (in CInode::pre_cow_old_inode()) is triggering buildup of old_inodes map.

I digged around the commit history and found this

commit 1bc6297cf85b7b9c287362be15cfa862a76685cc
Author: Yan, Zheng <ukernel@gmail.com>
Date:   Tue Feb 6 16:33:51 2018 +0800

    mds: update CInode/CDentry's first according to global snapshot seq

    This simplifies case that inode gets moved into different snaprealm

    Signed-off-by: "Yan, Zheng" <zyan@redhat.com>

Basically, this commit introduces

diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index 45d90b464e4..eed85616910 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -2677,7 +2677,7 @@ void CInode::split_old_inode(snapid_t snap)

 void CInode::pre_cow_old_inode()
 {
-  snapid_t follows = find_snaprealm()->get_newest_seq();
+  snapid_t follows = mdcache->get_global_snaprealm()->get_newest_seq();
   if (first <= follows)
     cow_old_inode(follows, true);
 }

So, basically this moved from check from comparing the sequence number of the snap realm the inode is part of to using the global snap realm's sequence (which is max value last_created/last_destroyed) in favour of simplified movement of inode into a different snap realm. Unfortunately, the commit message isn't much detailed but the whole change is centred around using the global snap realm sequence for comparison.

Actions

Copy link

Updated by Konstantin Shalygin about 1 year ago

Backport changed from quincy,reef,squid to reef,squid

Actions

Copy link

Updated by Venky Shankar about 1 year ago

This issue will be transparently fixed when global snap realm is done away relying on referent inode work.

Actions

Copy link

Updated by Venky Shankar about 1 year ago

Related to Feature #54205: hard links: explore using a "referent" inode whenever hard linking added

Actions

Copy link

#10

Updated by Venky Shankar about 1 year ago

Priority changed from Normal to Immediate
Severity changed from 2 - major to 1 - critical

Actions

Copy link

#11

Updated by Venky Shankar about 1 year ago

I've started digging into how global snap realm is designed to be able to come up with possible solutions for this. Just FYI, global snap realm goes as back as the mimic release and therefore, technically, this issue goes as far as that.

Actions

Copy link

#12

Updated by Venky Shankar 12 months ago

Me and @Kotresh Hiremath Ravishankar spend some time today in working out a solution for this not involving removing the global snap realm. I had initially proposed a solution where COW'ing the old_inodes can make use of the sequence number of the subvolume root snap realm rather than the global snap realm and stopping the COW when the subvolume root is reached in MDCache::predirty_journal_parents. This would nicely avoid the COW mayhem but this is really only valid when the file system is purely used via subvolumes.

So, we discussed various ways to workaround the COW mayhem in general, one of which was to introduce a new sequence number in the snap realm for the sole purpose of deciding when to COW and inode (and keeping it up-to-date with other snap realms for directories which may have hard links). When experimenting with this, we ran into surprises with how the old_inodes structure is maintained. Basically, the old_inodes would get COW'd (say, for /dir1) even when a different directory tree is snapshotted (say, /dir2, since currently, the COW mechanism relies on the global snap realm sequence number), however, the inode (for /dir1) itself would not be flushed to the journal, thereby, the on-disk inode would be out-of-date. Eventually, when an operation (e.g. mksnap /dir1/...) is done on this directory, the entire inode gets journaled and written to omap (eventually). Furthermore, there is also a case where the old inodes would get trimmed (we saw the old inodes count in the inode log dump decrease) which was new to me.

So, basically, this needs more experimentation to be able to come up with a generic solution that works with snapshots possibly taken on any directory tree.

Actions

Copy link

#13

Updated by Venky Shankar 12 months ago

Related to Bug #70794: mds: use subvolume directory snap realm for COW'ing old_inodes added

Actions

Copy link

#14

Updated by Patrick Donnelly 9 months ago

Target version deleted (~~v20.0.0~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Tags

Custom queries

Bug #67102

mds: read-only file system due to large old_inodes

Updated by Venky Shankar over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Konstantin Shalygin about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Venky Shankar 12 months ago

Updated by Venky Shankar 12 months ago

Updated by Patrick Donnelly 9 months ago