Bug #67102
openmds: read-only file system due to large old_inodes
0%
Description
This is being hit by downstream QE. During OMAP commit, the mds collects set of items to update and remove for a directory inode. In one case, for the volumes directory, the number of old_inodes is large. E.g.:
2024-07-18T09:21:23.451+0000 7ff39d281640 10 mds.0.cache.dir(0x1) _omap_commit 2024-07-18T09:21:23.451+0000 7ff39d281640 10 mds.0.cache.dir(0x1) set volumes [dentry #0x1/volumes [2,head] auth (dversion lock) v=1602786 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x5588d5be0500] 2024-07-18T09:21:23.451+0000 7ff39d281640 14 mds.0.cache.dir(0x1) dn 'volumes' inode [inode 0x10000000000 [...fa85a,head] /volumes/ auth v1602786 f(v0 m2024-07-12T21:27:05.326445+0000 86=82+4) n(v118053 rc2024-07-17T12:56:02.435948+0000 b994986246094 rs129 300972=98111+202861) old_inodes=671000 (inest lock dirty) | dirtyscattered=1 dirfrag=1 dirty=1 0x5588d5fe2b00]
As it can be seen: old_inodes=671000. CInode::old_inodes is:
``
old_inode_map_const_ptr old_inodes; // key = last, value.first = first
```
where `old_inode_map_const_ptr' is:
```
using mempool_old_inode = old_inode_t<mempool::mds_co::pool_allocator>;
using mempool_old_inode_map = mempool::mds_co::map<snapid_t, mempool_old_inode>;
using old_inode_map_ptr = std::shared_ptr<mempool_old_inode_map>;
using old_inode_map_const_ptr = std::shared_ptr<const mempool_old_inode_map>;
```
Basically a map b/w snapid and an inode. So, this has to be related to snapshots and my guess is there are too many snapshots causing this encoded inode (into buffer list) to blow up the osd_op size.
Updated by Venky Shankar over 1 year ago
In the cluster where this issue was observed, dump snaps showed
The snapid values look legit
ceph tell mds.0 dump snaps
{
"last_created": 1026138,
"last_destroyed": 1026134,
From the inode dump:
inode [inode 0x10000000000 [...fa85a,head]
0xfa85a==1026138, so the inode first value is legit. Question is why old_inodes is such a huge value for volumes directory since snapshots are taken under the subvolume directory.
Updated by Venky Shankar over 1 year ago
So, this is what I see happening - if a snapshot is taken (say) at the root directory followed by snapshots on directories under that tree, old_inodes continues to build up on the root directory inode -- increasing each time a snapshot is taken on any directory under the tree, due to copy-on-write I believe.
Updated by Venky Shankar over 1 year ago
Finally, I got to digging into this today. The repetitive mutation of old_inodes is caused by the global snap realm. The relevant function that is involved with this is in CInode.cc:
void CInode::pre_cow_old_inode()
{
snapid_t follows = mdcache->get_global_snaprealm()->get_newest_seq();
dout(20) << __func__ << " follows " << follows << " on " << *this << dendl;
if (first <= follows)
cow_old_inode(follows, true);
}
Callers to CInode::pre_cow_old_inode() are:
src/mds/Locker.cc: in->pre_cow_old_inode(); // avoid cow mayhem src/mds/MDCache.cc: pin->pre_cow_old_inode(); // avoid cow mayhem!
The interesting one here is the call in MDCache::predirty_journal_parents(), thereby invoking CInode::pre_cow_old_inode() for each ancestor path upto root. This is fine, however, the condition that triggers COWing old_inodes in CInode::pre_cow_old_inode() is first <= follows check where follows is the sequence number (SnapRealm::cached_seq) of the global snap realm. So, when sub-directory snapshots are taken, the sequence number of the global snap realm gets incremented and then MDCache::predirty_journal_parents() is going to COW old_inodes for ancestors which are snapshotted.
I also think it has always been this way, at least from the time when global snap realm got introduced.
Updated by Venky Shankar over 1 year ago
So, there is special handling of global snap realm in SnapRealm.cc. In SnapRealm::check_cache()
if (global || srnode.is_parent_global()) {
last_created = mdcache->mds->snapclient->get_last_created();
seq = std::max(last_created, last_destroyed);
} else {
last_created = srnode.last_created;
seq = srnode.seq;
}
if (cached_seq >= seq &&
cached_last_destroyed == last_destroyed)
return;
cached_snap_context.clear();
cached_seq = seq;
cached_last_created = last_created;
cached_last_destroyed = last_destroyed;
For global snap realm, the sequence number is the max value of the last created or destroyed snapshot in the system and not for the snap realm itself. Creating and deleting snapshots anywhere in the tree is going to increment the sequence number of the global snap realm. This along with the check mentioned in note-3 (in CInode::pre_cow_old_inode()) is triggering buildup of old_inodes map.
Updated by Venky Shankar over 1 year ago
Greg nudged me on this recently. So, there seems to be some relation between the global snap realm and nested snapshots since the issue of old_inodes blowing up isn't seen when nested directory snapshots are not taken.
Updated by Venky Shankar over 1 year ago
Venky Shankar wrote in #note-4:
So, there is special handling of global snap realm in
SnapRealm.cc. InSnapRealm::check_cache()[...]
For global snap realm, the sequence number is the max value of the last created or destroyed snapshot in the system and not for the snap realm itself. Creating and deleting snapshots anywhere in the tree is going to increment the sequence number of the global snap realm. This along with the check mentioned in note-3 (in
CInode::pre_cow_old_inode()) is triggering buildup ofold_inodesmap.
I digged around the commit history and found this
commit 1bc6297cf85b7b9c287362be15cfa862a76685cc
Author: Yan, Zheng <ukernel@gmail.com>
Date: Tue Feb 6 16:33:51 2018 +0800
mds: update CInode/CDentry's first according to global snapshot seq
This simplifies case that inode gets moved into different snaprealm
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Basically, this commit introduces
diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index 45d90b464e4..eed85616910 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -2677,7 +2677,7 @@ void CInode::split_old_inode(snapid_t snap)
void CInode::pre_cow_old_inode()
{
- snapid_t follows = find_snaprealm()->get_newest_seq();
+ snapid_t follows = mdcache->get_global_snaprealm()->get_newest_seq();
if (first <= follows)
cow_old_inode(follows, true);
}
So, basically this moved from check from comparing the sequence number of the snap realm the inode is part of to using the global snap realm's sequence (which is max value last_created/last_destroyed) in favour of simplified movement of inode into a different snap realm. Unfortunately, the commit message isn't much detailed but the whole change is centred around using the global snap realm sequence for comparison.
Updated by Konstantin Shalygin about 1 year ago
- Backport changed from quincy,reef,squid to reef,squid
Updated by Venky Shankar about 1 year ago
This issue will be transparently fixed when global snap realm is done away relying on referent inode work.
Updated by Venky Shankar about 1 year ago
- Related to Feature #54205: hard links: explore using a "referent" inode whenever hard linking added
Updated by Venky Shankar about 1 year ago
- Priority changed from Normal to Immediate
- Severity changed from 2 - major to 1 - critical
Updated by Venky Shankar about 1 year ago
I've started digging into how global snap realm is designed to be able to come up with possible solutions for this. Just FYI, global snap realm goes as back as the mimic release and therefore, technically, this issue goes as far as that.
Updated by Venky Shankar 12 months ago
Me and @Kotresh Hiremath Ravishankar spend some time today in working out a solution for this not involving removing the global snap realm. I had initially proposed a solution where COW'ing theĀ old_inodes can make use of the sequence number of the subvolume root snap realm rather than the global snap realm and stopping the COW when the subvolume root is reached in MDCache::predirty_journal_parents. This would nicely avoid the COW mayhem but this is really only valid when the file system is purely used via subvolumes.
So, we discussed various ways to workaround the COW mayhem in general, one of which was to introduce a new sequence number in the snap realm for the sole purpose of deciding when to COW and inode (and keeping it up-to-date with other snap realms for directories which may have hard links). When experimenting with this, we ran into surprises with how the old_inodes structure is maintained. Basically, the old_inodes would get COW'd (say, for /dir1) even when a different directory tree is snapshotted (say, /dir2, since currently, the COW mechanism relies on the global snap realm sequence number), however, the inode (for /dir1) itself would not be flushed to the journal, thereby, the on-disk inode would be out-of-date. Eventually, when an operation (e.g. mksnap /dir1/...) is done on this directory, the entire inode gets journaled and written to omap (eventually). Furthermore, there is also a case where the old inodes would get trimmed (we saw the old inodes count in the inode log dump decrease) which was new to me.
So, basically, this needs more experimentation to be able to come up with a generic solution that works with snapshots possibly taken on any directory tree.
Updated by Venky Shankar 12 months ago
- Related to Bug #70794: mds: use subvolume directory snap realm for COW'ing old_inodes added