mds: adjust the locks order by lxbsz · Pull Request #56805 · ceph/ceph

lxbsz · 2024-04-10T03:04:39Z

This will just make the locks order as:

1, filelock
2, netstlock
3, authlock
4, linklock
5, snaplock
6, xattrlock
7, policylock
8, versionlock
9, flocklock
10, dirfragtreelock

Fixes: https://tracker.ceph.com/issues/62123

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

leonid-s-usov · 2024-04-10T09:40:44Z

Shouldn't the order be aligned with

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

?

batrick · 2024-04-10T12:57:37Z

src/mds/Server.cc

+  if ((mask & CEPH_CAP_LINK_SHARED) && !(issued & CEPH_CAP_LINK_EXCL))
+    lov.add_rdlock(&ref->linklock);
+  if ((mask & CEPH_CAP_XATTR_SHARED) && !(issued & CEPH_CAP_XATTR_EXCL))
+    lov.add_rdlock(&ref->xattrlock);


We don't need to actually sort these when adding to the LockOpVec:

ceph/src/mds/Mutation.cc

Lines 116 to 162 in 97cb38d

void MutationImpl::LockOpVec::sort_and_merge()

{

// sort locks on the same object

auto cmp = [](const LockOp &l, const LockOp &r) {

ceph_assert(l.lock->get_parent() == r.lock->get_parent());

return l.lock->type->type < r.lock->type->type;

};

for (auto i = begin(), j = i; ; ++i) {

if (i == end()) {

std::sort(j, i, cmp);

break;

}

if (j->lock->get_parent() != i->lock->get_parent()) {

std::sort(j, i, cmp);

j = i;

}

}

// merge ops on the same lock

for (auto i = end() - 1; i > begin(); ) {

auto j = i;

while (--j >= begin()) {

if (i->lock != j->lock)

break;

}

if (i - j == 1) {

i = j;

continue;

}

// merge

++j;

for (auto k = i; k > j; --k) {

if (k->is_remote_wrlock()) {

ceph_assert(!j->is_remote_wrlock());

j->wrlock_target = k->wrlock_target;

}

j->flags |= k->flags;

}

if (j->is_xlock()) {

// xlock overwrites other types

ceph_assert(!j->is_remote_wrlock());

j->flags = LockOp::XLOCK;

}

erase(j + 1, i + 1);

i = j - 1;

}

}

because the locks are sorted before acquisition.

I think what needs to happen for

https://tracker.ceph.com/issues/62123

is to modify Locker::acquire_locks to look at the locks held by the mutation to ensure that it does not already hold locks (not necessarily in the LockOpVec) that would reverse the locking order if the locks in the LockOpVec are acquired.

That is probably an expensive consistency check so it would need to be a debug config that turns it on.

Yeah, you are correct. I missed this part. I will have a look about this again. Thanks @batrick

That is probably an expensive consistency check so it would need to be a debug config that turns it on.

Not just that, the check should be configurable with a flag on the acquire locks call to be able to suppress it for some edge cases where we know what we are doing.

Actually the issue here is that when a single request triggers two 'acquire_locks()' then it will be possibly break the lock-order and then cause lock-order deadlocks between multiple requests.
Anyway we should start over the request when the second 'acquire_locks()' fails as @batrick did in #52522, which I think @batrick just fixed one part of it, not all.

@batrick @leonid-s-usov I have pushed a new commit to fix the getattr and setattr only for now, please check and if this is the correct direction I will continue to fix all the other cases.

Thanks.

lxbsz · 2024-04-10T13:03:19Z

Shouldn't the order be aligned with

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

?

This sounds a good idea, but this will change the path_traverse() code much, which is very complicated. I do think it's risky.

@vshankar @batrick What do you think ?

batrick · 2024-04-10T16:29:31Z

This sounds a good idea, but this will change the path_traverse() code much, which is very complicated. I do think it's risky.

Change path_traverse how?

lxbsz · 2024-04-12T02:21:35Z

This sounds a good idea, but this will change the path_traverse() code much, which is very complicated. I do think it's risky.

Change path_traverse how?

@batrick Please ignore this.

batrick · 2024-04-12T13:41:01Z

src/mds/Server.cc


-  if (!mds->locker->acquire_locks(mdr, lov))
+  if (!mds->locker->acquire_locks(mdr, lov)) {
+    restart_over_the_request(mdr);


This doesn't look right. Anytime the Locker is nudging a lock state (perhaps due to conflicting caps) then we will drop all locks. That will prevent the request from making progress.

Then I need to do this in another way and maybe more complex. I am thinking should we drop the newer request's locks instead of dropping current one always.

#define CEPH_LOCK_DN (1 << 0) #define CEPH_LOCK_DVERSION (1 << 1) #define CEPH_LOCK_IQUIESCE (1 << 4) /* mds internal */ #define CEPH_LOCK_ISNAP (1 << 5) /* snapshot lock. MDS internal */ #define CEPH_LOCK_IPOLICY (1 << 6) /* policy lock on dirs. MDS internal */ #define CEPH_LOCK_IFILE (1 << 7) #define CEPH_LOCK_INEST (1 << 8) /* mds internal */ #define CEPH_LOCK_IDFT (1 << 9) /* dir frag tree */ #define CEPH_LOCK_IAUTH (1 << 10) #define CEPH_LOCK_ILINK (1 << 11) #define CEPH_LOCK_IXATTR (1 << 12) #define CEPH_LOCK_IFLOCK (1 << 13) /* advisory file locks */ #define CEPH_LOCK_IVERSION (1 << 14) /* mds internal */

For example:

Step Older req1 Newer req2 1 - comes 2 - acquire_locks(ISNAP, IFILE and IAUTH) - comes then for some reason goes to sleep 3 - acquire_locks(INEST and ILINK) then goes to sleep too for some reasons 4 - wakes up and the second acquire_locks(INEST and IXATTR) fails since req2 has already held `INEST` 5 - wakes up and then the second acquire_locks(IAUTH) will fail since req1 has held `IAUTH` .

Then req1 and req2 will be deadlocked.

I am thinking should we make the older request req1 to drop the locks instead of req2 ? In above example the seq1 will dectect possible deadlock in acquire_locks() in Step 4 firstly. So if we try to drop req1's locks, then the req2 could be finished earlier than req1.

I find the currently code is doing without considering the request's initiating time, will it be a problem ? @batrick

What you describe is a deadlock, though not because of a lack of interfaces, but rather due to the bad locking order by req1 and/or req2.

We should be careful updating lock orders in bulk, since the current stable state may be kicked out of balance, even if there are a few places where it would deadlock today.

in general, requests should try and take all the locks they need in a single call to acquire_locks, unless they have multiple phases. In the latter case, the request should be responsible for releasing locks from phase 1 before re-acquiring them in phase 2 along with other locks. The only case when a request may keep locks from phase 1 to phase 2 is if they are ordered before the new locks from phase 2.

req1 violates this by taking INEST while holding IFILE. Whether req1 can afford dropping and then re-acquiring INEST along with IFILE and IXATTR is not known, but it's req1's responsibility to never try to lock backwards in that defined lock order.
Note that it is OK for req1 to continue holding ISNAP and IFILE, since no new locks are ordered before those two

What you describe is a deadlock, though not because of a lack of interfaces, but rather due to the bad locking order by req1 and/or req2.

We should be careful updating lock orders in bulk, since the current stable state may be kicked out of balance, even if there are a few places where it would deadlock today.

in general, requests should try and take all the locks they need in a single call to acquire_locks, unless they have multiple phases. In the latter case, the request should be responsible for releasing locks from phase 1 before re-acquiring them in phase 2 along with other locks. The only case when a request may keep locks from phase 1 to phase 2 is if they are ordered before the new locks from phase 2.

req1 violates this by taking INEST while holding IFILE. Whether req1 can afford dropping and then re-acquiring INEST along with IFILE and IXATTR is not known, but it's req1's responsibility to never try to lock backwards in that defined lock order. Note that it is OK for req1 to continue holding ISNAP and IFILE, since no new locks are ordered before those two

Unfortunately there are many cases will call the acquire_locks() twice or more. Such as for the getattr request it will do it first when traversing the path in traverse_path() and then in the _getattr() func. And this really will mess the locks order.

It's not that easy to merge them in a single acquire_locks() call.

Yes, understood, which is why I mentioned those multi-phase requests, where each phase takes additional locks. Those should be designed to only acquire locks in ascending order

leonid-s-usov · 2024-04-15T12:08:09Z

Following @batrick 's suggestion, we should add a trap for requests that attempt to acquire locks out of order.
I'm not sure whether it should be an assert or a warning. My first thought is that a warning would be more useful, as we can then analyze logs from our regular FS suite to see how often this is violated in general before we go and prevent this in a hard way.

lxbsz · 2024-04-15T12:19:34Z

Following @batrick 's suggestion, we should add a trap for requests that attempt to acquire locks out of order. I'm not sure whether it should be an assert or a warning. My first thought is that a warning would be more useful, as we can then analyze logs from our regular FS suite to see how often this is violated in general before we go and prevent this in a hard way.

Yeah, counted this somewhere sounds reasonable.

Locally I have a patch to detect this, but will simply drop the locks for the current request. What I think the most ideal approach is to drop the lock for the older request when the deadlock is detected. I will push the patch first but haven't well tested yet.

leonid-s-usov

Yes, that check looks good.

My concern though is the drop_locks. I'm worried that this can lead to some unwanted side effects for requests that don't expect locks that were acquired to be dropped. Those requests may have some non-idempotent code re-executed when replayed. Also, requests may have some custom cached parameters which need to be invalidated before replaying.

I believe that our first step should be to detect all as many flows that violate the strict locking order as possible, and then decide on the actions we should take.

Please note that your current check may give (potentially, many) false positives even with any_waiters in place.

leonid-s-usov · 2024-04-15T12:42:44Z

src/mds/Locker.cc

+  }
+
+  if (order_mess && any_waiter) {
+    dout(10) << " detected possible dead locks; dropping all locks" << dendl;


I believe this should be level 3 or even less, dropping locks is a desperate measure on our part.

Sounds reasonable.

leonid-s-usov · 2024-04-15T12:46:27Z

src/mds/Locker.cc

+    dout(15) << " lock " << *lock << " any_waiter " << any_waiter << " order_mess " << order_mess << dendl;
+  }
+
+  if (order_mess && any_waiter) {


We should add another high-level warning (i.e. dout (1 < x < 5)) for order_mess alone, maybe even in health to make the corresponding FS runs fail and have our attention

Sure, will fix it.

leonid-s-usov · 2024-04-15T12:50:12Z

src/mds/Locker.cc

+  if (order_mess && any_waiter) {
+    dout(10) << " detected possible dead locks; dropping all locks" << dendl;
+    drop_locks(mdr.get(), NULL);
+    mdr->drop_local_auth_pins();


If we chose to drop locks on a request then we must terminate this acquire_locks call with a false and put the request on the replay queue

For this the rdlock_start(), etc has already done it. IMO we still need to wait the corresponding locks to be ready in rdlock_start(), etc, else this request could be fired requently in a loop by doing this check here.

it shouldn't go into an infinite loop because we assume it has progressed partially (i.e. holding some locks) and is now about to deadlock (or we think it might). When we drop the locks it has now, it will most probably get stuck on those once it re-executes. We can't afford to progress with the current call to acquire_locks, because the locks we've dropped may not be present in the lov vector.

We can't add the locks back to lov and re-lock everything right here, because it will break the request's assumptions about the atomicity of its operation.

So if we drop the locks we need to abort and re-try the request as a whole

For this the rdlock_start(), etc has already done it.

Yes, you are right, the request is already on a waiting list. But it's on a waiting list for a wrong lock. We need to re-run the request from scratch now so that it will restart acquiring locks in the correct order, waiting on the appropriate locks this time.

leonid-s-usov · 2024-04-18T07:11:21Z

There must be a safe way to drop locks and restart requests, but it may require significantly more effort to do so reliably. I think we should start slow and begin by identifying requests that take incremental locks in the wrong order, and then decide how to deal with them

leonid-s-usov · 2024-04-18T07:28:39Z

The only safe way to recover from a deadlock IMO is

make sure to detect a deadlock without false positives. This will involve MDR cross-check.
out of two requests that are about to deadlock, pick a client request, if both are client requests then pick the youngest
drop the chosen MDR along with its locks and start a new one from the same mclient request

The key here is to re-initialize the new MDR from scratch, which is possible given the original MClientRequest we can extract from the dropped MDR.
It may not be possible to do this safely for internal requests, so if we have two internal requests that deadlock we should probably assert.

batrick

Otherwise, this looks clean and good.

batrick · 2024-04-18T15:23:07Z

src/mds/Locker.cc

+    if (lock->is_waiter_for(SimpleLock::WAIT_ALL)) {
+      any_waiter = true;
+    }
+    if (lock->get_type() > lock_type) {


You need to check the lock parent is the same as well.

@batrick Will this make sense ?

#define CEPH_LOCK_DN (1 << 0) #define CEPH_LOCK_DVERSION (1 << 1) #define CEPH_LOCK_IQUIESCE (1 << 4) /* mds internal */ #define CEPH_LOCK_ISNAP (1 << 5) /* snapshot lock. MDS internal */ #define CEPH_LOCK_IPOLICY (1 << 6) /* policy lock on dirs. MDS internal */ #define CEPH_LOCK_IFILE (1 << 7) #define CEPH_LOCK_INEST (1 << 8) /* mds internal */ #define CEPH_LOCK_IDFT (1 << 9) /* dir frag tree */ #define CEPH_LOCK_IAUTH (1 << 10) #define CEPH_LOCK_ILINK (1 << 11) #define CEPH_LOCK_IXATTR (1 << 12) #define CEPH_LOCK_IFLOCK (1 << 13) /* advisory file locks */ #define CEPH_LOCK_IVERSION (1 << 14) /* mds internal */

The _DN lock's parent will be different from the others always, right ?

Such as for the deadlock issue in [ceph-users] MDS Behind on Trimming..., from current block ops output it's a deadlock between CInode's _IFILE or _INEST and a _DN.

No, I'm referring to SimpleLock::parent which is the CInode/CDentry the lock is protecting. You need to ensure that the parent is the same when comparing the locks.

Let's me explain more about this.

For example firstly a request acquires:

in->filelock in->authlock

Then later in another acquire_locks() when trying to acquire the dn->lock, if we check the parents of the locks then the order will be:

in->filelock in->authlock dn->lock

But actually it should be

#define CEPH_LOCK_DN (1 << 0) #define CEPH_LOCK_IFILE (1 << 7) #define CEPH_LOCK_IAUTH (1 << 10)

So IMO we should ignore the parents check anyway.

Did I miss anything here ? Hope we are on the same page :-)

The locking order is not simply the lock type. It is documented here:

ceph/src/mds/MDCache.cc

Lines 8264 to 8281 in 367a849

/**

* In 246f647566095c173e5e0e54661696cea230f96e, an updated rule for locking order

* was established (differing from past strategies):

*

* [The helper function is for requests that operate on two paths. It

* ensures that the two paths get locks in proper order.] The rule is:

*

* 1. Lock directory inodes or dentries according to which trees they

* are under. Lock objects under fs root before objects under mdsdir.

* 2. Lock directory inodes or dentries according to their depth, in

* ascending order.

* 3. Lock directory inodes or dentries according to inode numbers or

* dentries' parent inode numbers, in ascending order.

* 4. Lock dentries in the same directory in order of their keys.

* 5. Lock non-directory inodes according to inode numbers, in ascending

* order.

*/

but I'm unsure if that captures all of the implicit (by code) rules.

batrick · 2024-04-18T15:23:19Z

src/mds/Locker.cc

+    }
+    dout(15) << " lock " << *lock << " any_waiter " << any_waiter << " order_mess " << order_mess << dendl;
+
+    if (order_mess && any_waiter) {


This needs to be turned on via a config. I think it should be off by default.

QA should turn it on and this method should message the clog with an ERR that we are resolving a possible deadlock so it fails tests.

This sounds reasonable and I will add it.

lxbsz · 2024-04-22T01:39:18Z

There must be a safe way to drop locks and restart requests, but it may require significantly more effort to do so reliably. I think we should start slow and begin by identifying requests that take incremental locks in the wrong order, and then decide how to deal with them

Yeah, what I am thinking it to make the newest request to drop the locks instead always. For the first step let's just add the code to detect the deadlocks and then see how to improve it later.

lxbsz · 2024-04-22T01:43:41Z

The only safe way to recover from a deadlock IMO is

make sure to detect a deadlock without false positives. This will involve MDR cross-check.

out of two requests that are about to deadlock, pick a client request, if both are client requests then pick the youngest

drop the chosen MDR along with its locks and start a new one from the same mclient request

The key here is to re-initialize the new MDR from scratch, which is possible given the original MClientRequest we can extract from the dropped MDR. It may not be possible to do this safely for internal requests, so if we have two internal requests that deadlock we should probably assert.

Yeah, this is almost the same as what I mentioned in #56805 (comment). For now we need to detect where and how the deadlock could happen, for some cases it's really strange and it's hard to figure out such as the one in [ceph-users] MDS Behind on Trimming.... Doing this will certainly involve a lot of efforts and let's do this after we can clearly figure it out with the debug logs.

dparmar18 · 2024-04-22T11:44:06Z

src/common/options/mds.yaml.in

+- name: mds_detect_deadlocks
+  type: bool
+  level: advanced
+  desc: flag to detect deadlocks bebtween requests.


s/bebtween/between

dparmar18 · 2024-04-22T11:44:15Z

src/common/options/mds.yaml.in

+  type: bool
+  level: advanced
+  desc: flag to detect deadlocks bebtween requests.
+  fmt_desc: To enable it to detect deadlocks bebtween requests.


Fixed them both. Thanks @dparmar18

dparmar18 · 2024-04-22T11:46:21Z

src/mds/Locker.cc

+      any_waiter = true;
+    }
+    if (lock->get_type() > lock_type) {
+      dout(1) << " detected mess lock order, locks: " << mdr->locks << dendl;


did you mean messy/messed? also i think the space after "detected" is not needed unless you'd want to have __func__

I'd prefer to keep this space, because this will be in the loop and could be many similar items will be put.

dparmar18 · 2024-04-22T11:52:25Z

src/mds/Locker.cc

+    if (lock->get_type() > lock_type) {
+      dout(1) << " detected mess lock order, locks: " << mdr->locks << dendl;
+      order_mess = true;
+    }


(just trying to understand locks better), so this is the list i see in ceph_fs.h

#define CEPH_LOCK_DN (1 << 0) #define CEPH_LOCK_DVERSION (1 << 1) #define CEPH_LOCK_IQUIESCE (1 << 4) /* mds internal */ #define CEPH_LOCK_ISNAP (1 << 5) /* snapshot lock. MDS internal */ #define CEPH_LOCK_IPOLICY (1 << 6) /* policy lock on dirs. MDS internal */ #define CEPH_LOCK_IFILE (1 << 7) #define CEPH_LOCK_INEST (1 << 8) /* mds internal */ #define CEPH_LOCK_IDFT (1 << 9) /* dir frag tree */ #define CEPH_LOCK_IAUTH (1 << 10) #define CEPH_LOCK_ILINK (1 << 11) #define CEPH_LOCK_IXATTR (1 << 12) #define CEPH_LOCK_IFLOCK (1 << 13) /* advisory file locks */ #define CEPH_LOCK_IVERSION (1 << 14) /* mds internal */

when we say lock->get_type() > lock_type is a mess, does it mean that the locks less than the lock_type would be acquired and are in-order?

@dparmar18 when multiple locks need to be taken, they should be taken in a pre-defined order. If that order will be observed by all concurrent lockers, no deadlocks will occur.

CephFS approach to this is to sort locks in the acquire_locks by the lock_type:

bool Locker::acquire_locks(const MDRequestRef& mdr, MutationImpl::LockOpVec& lov, ... { ... lov.sort_and_merge(); ...

This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.

Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.

Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need. In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks.
It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.

Hey @leonid-s-usov thanks for taking time on this, couple of questions here:

This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.

Why/how could it be counter-productive? An already known lock set would mean better clarity?

Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.

Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need.

As you mentioned earlier some code paths start with a sub-set and then add locks as per their need, till now (since this patch's need came up recently) I assume things worked fine before? and maybe the quesce protocol added broke things up? or any new stuff added?

Shouldn't LockOpVec be used to make sure there are no such deadlocks? Rationale for this statement: it is a seq of the lock ops to be performed to ensure consistency, right? so if we're now doube-checking this then it means we aren't confident of the seq in LockOpVec? i.e. we aren't sure if the seq in the LockOpVec is in pre-defined order? shouldn't the LockOpVec be having a valid seq then? or am i misunderstanding the use case of LockOpVec and it is not that smart enough?

If we're dropping locks just like this, won't it be disastrous? There can be some I/O, we preempt it, the other thread/process that should've waited is now having the lock and the process earlier has some unfinished work left, doesn't this lead to data corruption/inconsistencies/loss?

In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks. It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.

so lets say a req had a xattr lock and now wants a file locks, so the MDS would drop the xattr lock then acquire the file lock and then xattr lock? and in the meantime what happens if the something else took up xattr?

Hey @leonid-s-usov thanks for taking time on this, couple of questions here:

This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.

Why/how could it be counter-productive? An already known lock set would mean better clarity?

Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.

Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need.

As you mentioned earlier some code paths start with a sub-set and then add locks as per their need, till now (since this patch's need came up recently) I assume things worked fine before? and maybe the quesce protocol added broke things up? or any new stuff added?

Mostly it worked well before, but recently we have hit at least two bugs caused by the lock order issue, please see #52522 and https://tracker.ceph.com/issues/65607.

Not introduced by the quesce changes.

Shouldn't LockOpVec be used to make sure there are no such deadlocks? Rationale for this statement: it is a seq of the lock ops to be performed to ensure consistency, right? so if we're now doube-checking this then it means we aren't confident of the seq in LockOpVec? i.e. we aren't sure if the seq in the LockOpVec is in pre-defined order? shouldn't the LockOpVec be having a valid seq then? or am i misunderstanding the use case of LockOpVec and it is not that smart enough?

The LockOpVec can only guarantee the lock order in each acquire_locks(), but couldn't guarantee multiple acquire_locks().

If we're dropping locks just like this, won't it be disastrous? There can be some I/O, we preempt it, the other thread/process that should've waited is now having the lock and the process earlier has some unfinished work left, doesn't this lead to data corruption/inconsistencies/loss?

There are already several places simply dropping the locks and we haven't seen any report about data corruption/inconsistencies/loss caused by it yet.

We have noticed the issue you mentioned and @leonid-s-usov and me agreed we should always drop the locks for the youngest request instead.

In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks. It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.

so lets say a req had a xattr lock and now wants a file locks, so the MDS would drop the xattr lock then acquire the file lock and then xattr lock? and in the meantime what happens if the something else took up xattr?

No, for the next retry of the request the lock order won't change.

☝🏻 What Xiubo said!

Why/how could it be counter-productive? An already known lock set would mean better clarity?

It depends on the particular request's logic. For a complicated process, it may be too expensive to take all locks at once when there are many different paths the logic could take.

till now (since this patch's need came up recently) I assume things worked fine before?

Violating the lock order doesn't guarantee a deadlock, it just makes it possible. We have very few known deadlock cases (see Xiubo's response for the recent ones), as common ones have already been addressed. Nevertheless, we should know about any requests that break the locking order before we get into more deadlocks by slightly affecting the timings of processes, or adding new requests written correctly but executing at the "wrong" time.

If we're dropping locks just like this, won't it be disastrous?

This needs to be analyzed and concluded for every given flow, there's no generic response here.

@lxbsz

We have noticed the issue you mentioned and @leonid-s-usov and me agreed we should always drop the locks for the youngest request instead.

So we're assuming that old requests should've completed their workload/IO? Is this always the case? If only the new requests are taken into consideration, that means we're okay with old requests out-of-order locks?

@leonid-s-usov

Violating the lock order doesn't guarantee a deadlock, it just makes it possible. We have very few known deadlock cases (see Xiubo's response for the recent ones), as common ones have already been addressed. Nevertheless, we should know about any requests that break the locking order before we get into more deadlocks by slightly affecting the timings of processes, or adding new requests written correctly but executing at the "wrong" time.

So a wrong timing would actually a good timing since it will avoid the deadlocks, is this always guaranteed in distributed systems? Can we always be sure of making the timings as-intended?

So a wrong timing would actually a good timing since it will avoid the deadlocks, is this always guaranteed in distributed systems? Can we always be sure of making the timings as-intended?

There are no timing guarantees for client-initiated processing in any system, let alone a distributed one. Even for internally scheduled processes there may be uncertain timing considerations, especially if we're running on a multi-cpu system with multiple worker threads.

While repeating usage patterns tend to produce similar timing of events, if something can occur out of order, it will do so sooner or later, subject to countless external conditions.

Taking locks in the correct order every time we eliminate the whole family of potential deadlocks, reducing our reliance on "proper timing"

So we're assuming that old requests should've completed their workload/IO? Is this always the case? If only the new requests are taken into consideration, that means we're okay with old requests out-of-order locks?

From the client's perspective, it doesn't make any difference which request got re-scheduled internally by the MDS as long as both requests finish eventually.

For one, it seems fair to let the older request proceed and the newer request retry.

Secondly, we need to better understand what it means to "retry". It's yet unclear whether all our requests are immune to a sudden replay, but I assume that should be the case given the journaling and request identifiers and other means of recoveries that Ceph is capable of.
If we're unsure, we could additionally argue that there are less chances for the new request to suffer from a retry just because it's probably done less by now, but that's a very shaky argument. However, it supports the approach of picking the newer request for the retry, letting the older proceed as we drop locks.

Finally, we should mention that retry itself is a dangerous endeavor. I don't think we are ready to introduce it, because just dropping the locks and rescheduling may not work for all requests. We need to properly reinitialize internal request state to make sure it genuinely starts anew, avoiding some unexpected states. I think that the safest way to do this would be to destroy the MDR and re-create it from MClientRequest we are holding on to, as if we just received it from the client. This is only possible for client requests, apparently; I don't think we have a good restart solution for internal requests. With this in mind, it's possible that when we detect a deadlock between a client request and an internal request, we may be forced to retry the client request even if it's older.

Another point to note is that if we go for a retry to avoid a deadlock, we need to be sure it would be a real deadlock, but it requires cross-checking all other MDRs that are waiting for any of the locks we already hold, so a cartesian product of checks.

If we decide to postpone active deadlock evasion, and instead focus on collecting info about potential deadlocks, then we can simply warn about flows that take locks out of order, and then decide what to do with those flows.

Finally, we should mention that retry itself is a dangerous endeavor. I don't think we are ready to introduce it, because just dropping the locks and rescheduling may not work for all requests. We need to properly reinitialize internal request state to make sure it genuinely starts anew, avoiding some unexpected states. I think that the safest way to do this would be to destroy the MDR and re-create it from MClientRequest we are holding on to, as if we just received it from the client. This is only possible for client requests, apparently; I don't think we have a good restart solution for internal requests. With this in mind, it's possible that when we detect a deadlock between a client request and an internal request, we may be forced to retry the client request even if it's older.

won't adding more complexity to the flow incur additional turbulence? if a newer req (internally) gets retried internally and if the client retries too, what happens? do we stop client from retrying momentarily?

won't adding more complexity to the flow incur additional turbulence? if a newer req (internally) gets retried internally and if the client retries too, what happens? do we stop client from retrying momentarily?

I don't think so. The Server shouldn't care whether the request is retried by a client or internally. If we're prepared to serve the client at any time, we're prepared for everything. Idempotency is our foundation here, we need to inject the internally-retried request early enough to go through all required checks, e.g. transaction id etc.

For some requests they may call the acquire_locks() more than once, and this will break the lock orders: #define CEPH_LOCK_DN (1 << 0) #define CEPH_LOCK_DVERSION (1 << 1) #define CEPH_LOCK_IQUIESCE (1 << 4) #define CEPH_LOCK_ISNAP (1 << 5) #define CEPH_LOCK_IPOLICY (1 << 6) #define CEPH_LOCK_IFILE (1 << 7) #define CEPH_LOCK_INEST (1 << 8) #define CEPH_LOCK_IDFT (1 << 9) #define CEPH_LOCK_IAUTH (1 << 10) #define CEPH_LOCK_ILINK (1 << 11) #define CEPH_LOCK_IXATTR (1 << 12) #define CEPH_LOCK_IFLOCK (1 << 13) #define CEPH_LOCK_IVERSION (1 << 14) Then multiple requests could be deadlocked. This commit will try to detect possible deadlocks and then drop the locks already acquired for current request. Fixes: https://tracker.ceph.com/issues/62123 Signed-off-by: Xiubo Li <xiubli@redhat.com>

This option will be disabled by default and will enable it for the cephfs qa tests. Fixes: https://tracker.ceph.com/issues/62123 Signed-off-by: Xiubo Li <xiubli@redhat.com>

github-actions · 2024-06-28T14:01:41Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-07-28T15:01:33Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

github-actions · 2024-09-27T02:24:34Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-10-27T02:27:00Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

lxbsz requested a review from a team April 10, 2024 03:04

github-actions bot added the cephfs Ceph File System label Apr 10, 2024

lxbsz force-pushed the wip-62123 branch from 74ea756 to 5069678 Compare April 10, 2024 03:05

batrick requested changes Apr 10, 2024

View reviewed changes

lxbsz marked this pull request as draft April 10, 2024 13:23

lxbsz force-pushed the wip-62123 branch from 5069678 to 238abe1 Compare April 12, 2024 02:20

batrick requested changes Apr 12, 2024

View reviewed changes

lxbsz force-pushed the wip-62123 branch from 238abe1 to 621f427 Compare April 15, 2024 12:26

leonid-s-usov reviewed Apr 15, 2024

View reviewed changes

lxbsz force-pushed the wip-62123 branch from 621f427 to dec5e2d Compare April 18, 2024 03:42

batrick requested changes Apr 18, 2024

View reviewed changes

lxbsz marked this pull request as ready for review April 22, 2024 01:26

lxbsz force-pushed the wip-62123 branch from dec5e2d to c800462 Compare April 22, 2024 02:14

github-actions bot added the common label Apr 22, 2024

dparmar18 reviewed Apr 22, 2024

View reviewed changes

lxbsz added 2 commits April 23, 2024 08:24

mds: mds_detect_deadlocks option support

12efe4a

This option will be disabled by default and will enable it for the cephfs qa tests. Fixes: https://tracker.ceph.com/issues/62123 Signed-off-by: Xiubo Li <xiubli@redhat.com>

lxbsz force-pushed the wip-62123 branch from c800462 to 12efe4a Compare April 23, 2024 00:25

github-actions bot added the stale label Jun 28, 2024

github-actions bot closed this Jul 28, 2024

lxbsz reopened this Jul 29, 2024

github-actions bot added tests and removed stale labels Jul 29, 2024

github-actions bot added the stale label Sep 27, 2024

github-actions bot closed this Oct 27, 2024

	void MutationImpl::LockOpVec::sort_and_merge()
	{
	// sort locks on the same object
	auto cmp = [](const LockOp &l, const LockOp &r) {
	ceph_assert(l.lock->get_parent() == r.lock->get_parent());
	return l.lock->type->type < r.lock->type->type;
	};
	for (auto i = begin(), j = i; ; ++i) {
	if (i == end()) {
	std::sort(j, i, cmp);
	break;
	}
	if (j->lock->get_parent() != i->lock->get_parent()) {
	std::sort(j, i, cmp);
	j = i;
	}
	}
	// merge ops on the same lock
	for (auto i = end() - 1; i > begin(); ) {
	auto j = i;
	while (--j >= begin()) {
	if (i->lock != j->lock)
	break;
	}
	if (i - j == 1) {
	i = j;
	continue;
	}
	// merge
	++j;
	for (auto k = i; k > j; --k) {
	if (k->is_remote_wrlock()) {
	ceph_assert(!j->is_remote_wrlock());
	j->wrlock_target = k->wrlock_target;
	}
	j->flags \|= k->flags;
	}
	if (j->is_xlock()) {
	// xlock overwrites other types
	ceph_assert(!j->is_remote_wrlock());
	j->flags = LockOp::XLOCK;
	}
	erase(j + 1, i + 1);
	i = j - 1;
	}
	}


	/**
	* In 246f647566095c173e5e0e54661696cea230f96e, an updated rule for locking order
	* was established (differing from past strategies):
	*
	* [The helper function is for requests that operate on two paths. It
	* ensures that the two paths get locks in proper order.] The rule is:
	*
	* 1. Lock directory inodes or dentries according to which trees they
	* are under. Lock objects under fs root before objects under mdsdir.
	* 2. Lock directory inodes or dentries according to their depth, in
	* ascending order.
	* 3. Lock directory inodes or dentries according to inode numbers or
	* dentries' parent inode numbers, in ascending order.
	* 4. Lock dentries in the same directory in order of their keys.
	* 5. Lock non-directory inodes according to inode numbers, in ascending
	* order.
	*/

Conversation

lxbsz commented Apr 10, 2024

Contribution Guidelines

Checklist

Uh oh!

leonid-s-usov commented Apr 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxbsz commented Apr 10, 2024

Uh oh!

batrick commented Apr 10, 2024

Uh oh!

lxbsz commented Apr 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxbsz Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov commented Apr 15, 2024

Uh oh!

lxbsz commented Apr 15, 2024

Uh oh!

leonid-s-usov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonid-s-usov commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxbsz Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lxbsz Apr 15, 2024 •

edited

Loading

leonid-s-usov Apr 15, 2024 •

edited

Loading

leonid-s-usov Apr 15, 2024 •

edited

Loading

leonid-s-usov Apr 18, 2024 •

edited

Loading

leonid-s-usov commented Apr 18, 2024 •

edited

Loading

leonid-s-usov commented Apr 18, 2024 •

edited

Loading

lxbsz Apr 22, 2024 •

edited

Loading

lxbsz commented Apr 22, 2024 •

edited

Loading

dparmar18 Apr 22, 2024 •

edited

Loading

leonid-s-usov Apr 22, 2024 •

edited

Loading