Skip to content

mds: async dir operation support#27866

Merged
batrick merged 19 commits intoceph:masterfrom
ukernel:wip-mds-locking-order
Dec 21, 2019
Merged

mds: async dir operation support#27866
batrick merged 19 commits intoceph:masterfrom
ukernel:wip-mds-locking-order

Conversation

@ukernel
Copy link
Contributor

@ukernel ukernel commented Apr 29, 2019

Change the order of taking locks to:

  1. Lock objects in fs tree first, then lock objects in mdsdir
  2. Lock objects according to their depth, in ascending order.
  3. Lock objects according to inodes' numbers or dentries' parent
    inode numbers, in ascending order.
  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

@ukernel ukernel added the cephfs Ceph File System label Apr 29, 2019
@ukernel ukernel force-pushed the wip-mds-locking-order branch 3 times, most recently from 28eed40 to ea3ee58 Compare May 1, 2019 10:19
@ukernel ukernel marked this pull request as ready for review May 2, 2019 02:11
@ukernel ukernel force-pushed the wip-mds-locking-order branch 8 times, most recently from 7bd4181 to 0ae14d0 Compare May 4, 2019 05:33
@ukernel ukernel force-pushed the wip-mds-locking-order branch 2 times, most recently from 31a90ce to f2cb698 Compare May 13, 2019 07:29
@ukernel ukernel changed the title mds: take locks in top-down order mds: async dir operation support May 13, 2019
@ukernel
Copy link
Contributor Author

ukernel commented May 13, 2019

@ukernel ukernel changed the title mds: async dir operation support [DO NOT MERGE] mds: async dir operation support May 13, 2019
@ukernel ukernel force-pushed the wip-mds-locking-order branch 10 times, most recently from ffb17f3 to 3a7801b Compare May 17, 2019 00:02
@jtlayton jtlayton requested review from batrick and jtlayton May 20, 2019 07:05
@ukernel
Copy link
Contributor Author

ukernel commented Dec 11, 2019

An old run http://pulpito.ceph.com/zyan-2019-12-06_02:10:01-fs-wip-mds-locking-order-testing-basic-smithi/.

The failed snaptests are because libc of rhel7 has no syncfs syscall

case CEPH_LOCK_IFLOCK: return "iflock";
case CEPH_LOCK_IPOLICY: return "ipolicy";
default: ceph_abort(); return std::string_view();
default: return "unkown";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "unknown"

lock->get_parent()->auth_pin(lock);
}

void Locker::xlock_downgrade(SimpleLock *lock, MutationImpl *mut)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "instrcuts" => "instructs" in commit message

@ukernel
Copy link
Contributor Author

ukernel commented Dec 12, 2019

http://pulpito.ceph.com/zyan-2019-12-10_08:41:40-fs-wip-mds-locking-order-testing-basic-smithi/

The failed snaptests are because libc of rhel7 has no syncfs syscall. No clue why libcephfs interface failed once

To take locks in top-down order for a MDRequest, we need to first take
snap/policy rdlocks on ancestor inodes of the request's base inode.
It's not convenient to use Locker::acquire_locks() to do the job because
path to request's base inode can change before all of these locks are
rdlocked.

This patch introduces Locker::try_rdlock_snap_layout(), which tries
taking snap/policy rdlocks on request's base inode and its ancestors
all at the same time. MDCache::path_traverse() calls this function at
first, then uses Locker::acquire_locks() to take snaplock on components
of request's path.

This patch also reorders inode locks, put snaplock and policy at the
front. Because some requests (such as setattr) may xlock other locks
after taking snaplock/policylock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
@jtlayton
Copy link
Contributor

jtlayton commented Dec 12, 2019

The failed snaptests are because libc of rhel7 has no syncfs syscall. No clue why libcephfs interface failed once

Are you sure? syncfs has been around a long time. In fact, looking at my centos7.7 VM:

/usr/include/unistd.h:extern int syncfs (int __fd) __THROW;

Note that programs that use it do need to define _GNU_SOURCE.

Introduce MDS_TRAVERSE_RDLOCK_PATH flag, which instrcuts
MDCache::path_traverse() to lock dentries during path traverse.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Introduce MDS_TRAVERSE_XLOCK_DENTRY, which instructs
MDCache::path_traverse() to take appropriate locks (xlock dentry,
wrlock directory's filelock/nestlock) for file create/deletion.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
The helper function is for requests that operate on two paths. It
ensures that the two paths get locks in proper order. The rule is:

 1. Lock directory inodes or dentries according to which trees they
    are under. Lock objects under fs root before objects under mdsdir.
 2. Lock directory inodes or dentries according to their depth, in
    ascending order.
 3. Lock directory inodes or dentries according to inode numbers or
    dentries' parent inode numbers, in ascending order.
 4. Lock dentries in the same directory in order of their keys.
 5. Lock non-directory inodes according to inode numbers, in ascending
    order.

This patch also makes handle_client_link() and handle_client_rename()
to use this helper function.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
If an inode's snaplock is locked by a MDRequest, the inode can not be
renamed by other MDRequest. For rename case, snaplocks of all inodes
in paths are locked. So the paths are stable while handling slave rename.
It's OK to discover all components (even they are xlocked) in the paths.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
MDS now relies on snaplocks to ensure that paths for slave request are
stable. MDCache::handle_find_ino_reply() may encounter xlocked dentry
during path traverse.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Take snap rdlocks (instead of taking dentry locks) on subtree's ancestor
inodes. Path to subtree is stable after they are all locked.

For dirfragtree locks on subtree bounds, it's not convenient to rdlock
them in top-down order (paths to them are not stable). The solution is
kicking them to SYNC state and try taking rdlocks on all of them.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
The lock cache preserves locks and authpins required for directory
operations. MDS can create a lock cache when it has acquired all locks
of a directory operation. The lock cache can be used to for later
operations of the same type on the same directory.

For example, when mds has acquired all locks of a unlink operation,
it creates a lock cache, which holds holds wrlocks on direcotry inode's
filelock and nestlock, rdlocks on ancestor inodes' snaplocks. For later
unlink operations on the same directory, MDS only needs to xlock the
dentry to unlink and xlock linklock of the inode to unlink.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Add a list to SimpleLock, which tracks lock caches which hold locks
on SimpleLock itself. When mds want to change lock state, it can find
lock caches and invalidate them.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Add a list to CDir, which tracks lock caches which hold auth pins on
objects in the CDir. When mds want to freeze dirfrag or subtree, it can
find lock caches and invalidate them

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Frozen inode in directory with lock caches may cause ABBA deadlock. The
requeset that owns the frozen inode can't get locks because lock cache
is holding conflicting locks. The request that uses lock cache can't
auth pin the frozen inode.

The solution is not creating lock cache when directory contains freezing
or frozen inode. When mds starts freezing an inode, invalidate all lock
caches on its parent directory.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Define cap bits for async dir operation. Lock cache for a given type of
dir operation can be delegeted to client through cap mechanism. As long
as client holds correspondindg cap, dir operation requests from the
client can use the lock cache. If mds want to invalidate a lock cache,
it needs to first revoke corresponding cap from client.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
later patch will introduce new mask bits

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
when client has 'CEPH_CAP_FILE_EXCL|CEPH_CAP_DIR_UNLINK' caps of a
directory inode, it can asynchronously unlink file from the directory
if the corresponding dentry is a primary link.

when client has 'CEPH_CAP_FILE_EXCL|CEPH_CAP_DIR_CREATE' caps of a
directory inode, it can asynchronously create new file in the directory
as long as no file with the same name exists.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Previous commit makes Server::rdlock_path_pin_ref() rdlock snaplock and
policylock. Deadlock happens if we use this function for requests that
xlock snaplock or snaplock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
@ukernel ukernel force-pushed the wip-mds-locking-order branch from 629e5e0 to 120d70d Compare December 12, 2019 18:06
@ukernel
Copy link
Contributor Author

ukernel commented Dec 13, 2019

you are right. I wrongly read the log. snaptests failures were caused by a bug in this PR

@ukernel
Copy link
Contributor Author

ukernel commented Dec 19, 2019

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this Zheng. I haven't looked at the code in detail yet but let's get this in now so you can continue to monitor nightly testing over the next few weeks.

batrick added a commit that referenced this pull request Dec 21, 2019
* refs/pull/27866/head:
	mds: fix deadlock when xlocking policylock
	mds: handle link request with zero depth filepath2
	mds: enable lock cache for openc/unlink requests
	mds: include linkage type in dentry lease
	mds: cleanup Server::set_trace_dist()
	mds: define lease mask bits
	mds: delegete lock cache to client
	mds: suppress frozen inode when locks of dir operation is cached.
	mds: invalidate lock caches when freezing dirfrag/subtree
	mds: invalidate lock caches if they hold conflicting locks
	mds: initial code for lock cache
	mds: adjust locking for subtree migration
	mds: add 'path_locked' flag to MDCache::find_ino_peers()
	mds: change MDCache::discover_path()'s wants_xlocked semantic
	mds: introduce Server::rdlock_two_paths_xlock_destdn()
	mds: make Server::rdlock_path_xlock_dentry take locks
	mds: make Server::rdlock_path_pin_ref() take dentry rdlocks
	mds: take snaplock and policylock during path traverse.
	mds: let Locker::acquire_locks()'s caller choose locking order

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick batrick merged commit 120d70d into ceph:master Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System needs-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants