Skip to content

mds: adjust the locks order#56805

Closed
lxbsz wants to merge 2 commits intoceph:mainfrom
lxbsz:wip-62123
Closed

mds: adjust the locks order#56805
lxbsz wants to merge 2 commits intoceph:mainfrom
lxbsz:wip-62123

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Apr 10, 2024

This will just make the locks order as:

1, filelock
2, netstlock
3, authlock
4, linklock
5, snaplock
6, xattrlock
7, policylock
8, versionlock
9, flocklock
10, dirfragtreelock

Fixes: https://tracker.ceph.com/issues/62123

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@lxbsz lxbsz requested a review from a team April 10, 2024 03:04
@github-actions github-actions bot added the cephfs Ceph File System label Apr 10, 2024
@leonid-s-usov
Copy link
Contributor

Shouldn't the order be aligned with

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

?

if ((mask & CEPH_CAP_LINK_SHARED) && !(issued & CEPH_CAP_LINK_EXCL))
lov.add_rdlock(&ref->linklock);
if ((mask & CEPH_CAP_XATTR_SHARED) && !(issued & CEPH_CAP_XATTR_EXCL))
lov.add_rdlock(&ref->xattrlock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to actually sort these when adding to the LockOpVec:

ceph/src/mds/Mutation.cc

Lines 116 to 162 in 97cb38d

void MutationImpl::LockOpVec::sort_and_merge()
{
// sort locks on the same object
auto cmp = [](const LockOp &l, const LockOp &r) {
ceph_assert(l.lock->get_parent() == r.lock->get_parent());
return l.lock->type->type < r.lock->type->type;
};
for (auto i = begin(), j = i; ; ++i) {
if (i == end()) {
std::sort(j, i, cmp);
break;
}
if (j->lock->get_parent() != i->lock->get_parent()) {
std::sort(j, i, cmp);
j = i;
}
}
// merge ops on the same lock
for (auto i = end() - 1; i > begin(); ) {
auto j = i;
while (--j >= begin()) {
if (i->lock != j->lock)
break;
}
if (i - j == 1) {
i = j;
continue;
}
// merge
++j;
for (auto k = i; k > j; --k) {
if (k->is_remote_wrlock()) {
ceph_assert(!j->is_remote_wrlock());
j->wrlock_target = k->wrlock_target;
}
j->flags |= k->flags;
}
if (j->is_xlock()) {
// xlock overwrites other types
ceph_assert(!j->is_remote_wrlock());
j->flags = LockOp::XLOCK;
}
erase(j + 1, i + 1);
i = j - 1;
}
}

because the locks are sorted before acquisition.

I think what needs to happen for

https://tracker.ceph.com/issues/62123

is to modify Locker::acquire_locks to look at the locks held by the mutation to ensure that it does not already hold locks (not necessarily in the LockOpVec) that would reverse the locking order if the locks in the LockOpVec are acquired.

That is probably an expensive consistency check so it would need to be a debug config that turns it on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are correct. I missed this part. I will have a look about this again. Thanks @batrick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably an expensive consistency check so it would need to be a debug config that turns it on.

Not just that, the check should be configurable with a flag on the acquire locks call to be able to suppress it for some edge cases where we know what we are doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the issue here is that when a single request triggers two 'acquire_locks()' then it will be possibly break the lock-order and then cause lock-order deadlocks between multiple requests.
Anyway we should start over the request when the second 'acquire_locks()' fails as @batrick did in #52522, which I think @batrick just fixed one part of it, not all.

@batrick @leonid-s-usov I have pushed a new commit to fix the getattr and setattr only for now, please check and if this is the correct direction I will continue to fix all the other cases.

Thanks.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 10, 2024

Shouldn't the order be aligned with

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

?

This sounds a good idea, but this will change the path_traverse() code much, which is very complicated. I do think it's risky.

@vshankar @batrick What do you think ?

@lxbsz lxbsz marked this pull request as draft April 10, 2024 13:23
@batrick
Copy link
Member

batrick commented Apr 10, 2024

This sounds a good idea, but this will change the path_traverse() code much, which is very complicated. I do think it's risky.

Change path_traverse how?

@lxbsz
Copy link
Member Author

lxbsz commented Apr 12, 2024

This sounds a good idea, but this will change the path_traverse() code much, which is very complicated. I do think it's risky.

Change path_traverse how?

@batrick Please ignore this.


if (!mds->locker->acquire_locks(mdr, lov))
if (!mds->locker->acquire_locks(mdr, lov)) {
restart_over_the_request(mdr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. Anytime the Locker is nudging a lock state (perhaps due to conflicting caps) then we will drop all locks. That will prevent the request from making progress.

Copy link
Member Author

@lxbsz lxbsz Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I need to do this in another way and maybe more complex. I am thinking should we drop the newer request's locks instead of dropping current one always.

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

For example:

Step           Older req1                                       Newer req2  
   1    - comes            
   2    - acquire_locks(ISNAP, IFILE and IAUTH)        - comes         
          then for some reason goes to sleep        
   3                                                   - acquire_locks(INEST and ILINK)
                                                         then goes to sleep too for some reasons
   4    - wakes up and the second
          acquire_locks(INEST and IXATTR) fails
          since req2 has already held `INEST`

   5                                                   - wakes up and then the second
                                                         acquire_locks(IAUTH) will fail since req1
                                                         has held `IAUTH` .

Then req1 and req2 will be deadlocked.

I am thinking should we make the older request req1 to drop the locks instead of req2 ? In above example the seq1 will dectect possible deadlock in acquire_locks() in Step 4 firstly. So if we try to drop req1's locks, then the req2 could be finished earlier than req1.

I find the currently code is doing without considering the request's initiating time, will it be a problem ? @batrick

Copy link
Contributor

@leonid-s-usov leonid-s-usov Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe is a deadlock, though not because of a lack of interfaces, but rather due to the bad locking order by req1 and/or req2.

We should be careful updating lock orders in bulk, since the current stable state may be kicked out of balance, even if there are a few places where it would deadlock today.

in general, requests should try and take all the locks they need in a single call to acquire_locks, unless they have multiple phases. In the latter case, the request should be responsible for releasing locks from phase 1 before re-acquiring them in phase 2 along with other locks. The only case when a request may keep locks from phase 1 to phase 2 is if they are ordered before the new locks from phase 2.

req1 violates this by taking INEST while holding IFILE. Whether req1 can afford dropping and then re-acquiring INEST along with IFILE and IXATTR is not known, but it's req1's responsibility to never try to lock backwards in that defined lock order.
Note that it is OK for req1 to continue holding ISNAP and IFILE, since no new locks are ordered before those two

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe is a deadlock, though not because of a lack of interfaces, but rather due to the bad locking order by req1 and/or req2.

We should be careful updating lock orders in bulk, since the current stable state may be kicked out of balance, even if there are a few places where it would deadlock today.

in general, requests should try and take all the locks they need in a single call to acquire_locks, unless they have multiple phases. In the latter case, the request should be responsible for releasing locks from phase 1 before re-acquiring them in phase 2 along with other locks. The only case when a request may keep locks from phase 1 to phase 2 is if they are ordered before the new locks from phase 2.

req1 violates this by taking INEST while holding IFILE. Whether req1 can afford dropping and then re-acquiring INEST along with IFILE and IXATTR is not known, but it's req1's responsibility to never try to lock backwards in that defined lock order. Note that it is OK for req1 to continue holding ISNAP and IFILE, since no new locks are ordered before those two

Unfortunately there are many cases will call the acquire_locks() twice or more. Such as for the getattr request it will do it first when traversing the path in traverse_path() and then in the _getattr() func. And this really will mess the locks order.

It's not that easy to merge them in a single acquire_locks() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, understood, which is why I mentioned those multi-phase requests, where each phase takes additional locks. Those should be designed to only acquire locks in ascending order

@leonid-s-usov
Copy link
Contributor

Following @batrick 's suggestion, we should add a trap for requests that attempt to acquire locks out of order.
I'm not sure whether it should be an assert or a warning. My first thought is that a warning would be more useful, as we can then analyze logs from our regular FS suite to see how often this is violated in general before we go and prevent this in a hard way.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 15, 2024

Following @batrick 's suggestion, we should add a trap for requests that attempt to acquire locks out of order. I'm not sure whether it should be an assert or a warning. My first thought is that a warning would be more useful, as we can then analyze logs from our regular FS suite to see how often this is violated in general before we go and prevent this in a hard way.

Yeah, counted this somewhere sounds reasonable.

Locally I have a patch to detect this, but will simply drop the locks for the current request. What I think the most ideal approach is to drop the lock for the older request when the deadlock is detected. I will push the patch first but haven't well tested yet.

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that check looks good.

My concern though is the drop_locks. I'm worried that this can lead to some unwanted side effects for requests that don't expect locks that were acquired to be dropped. Those requests may have some non-idempotent code re-executed when replayed. Also, requests may have some custom cached parameters which need to be invalidated before replaying.

I believe that our first step should be to detect all as many flows that violate the strict locking order as possible, and then decide on the actions we should take.

Please note that your current check may give (potentially, many) false positives even with any_waiters in place.

}

if (order_mess && any_waiter) {
dout(10) << " detected possible dead locks; dropping all locks" << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be level 3 or even less, dropping locks is a desperate measure on our part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

dout(15) << " lock " << *lock << " any_waiter " << any_waiter << " order_mess " << order_mess << dendl;
}

if (order_mess && any_waiter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add another high-level warning (i.e. dout (1 < x < 5)) for order_mess alone, maybe even in health to make the corresponding FS runs fail and have our attention

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will fix it.

if (order_mess && any_waiter) {
dout(10) << " detected possible dead locks; dropping all locks" << dendl;
drop_locks(mdr.get(), NULL);
mdr->drop_local_auth_pins();
Copy link
Contributor

@leonid-s-usov leonid-s-usov Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we chose to drop locks on a request then we must terminate this acquire_locks call with a false and put the request on the replay queue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this the rdlock_start(), etc has already done it. IMO we still need to wait the corresponding locks to be ready in rdlock_start(), etc, else this request could be fired requently in a loop by doing this check here.

Copy link
Contributor

@leonid-s-usov leonid-s-usov Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't go into an infinite loop because we assume it has progressed partially (i.e. holding some locks) and is now about to deadlock (or we think it might). When we drop the locks it has now, it will most probably get stuck on those once it re-executes. We can't afford to progress with the current call to acquire_locks, because the locks we've dropped may not be present in the lov vector.

We can't add the locks back to lov and re-lock everything right here, because it will break the request's assumptions about the atomicity of its operation.

So if we drop the locks we need to abort and re-try the request as a whole

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this the rdlock_start(), etc has already done it.

Yes, you are right, the request is already on a waiting list. But it's on a waiting list for a wrong lock. We need to re-run the request from scratch now so that it will restart acquiring locks in the correct order, waiting on the appropriate locks this time.

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Apr 18, 2024

There must be a safe way to drop locks and restart requests, but it may require significantly more effort to do so reliably. I think we should start slow and begin by identifying requests that take incremental locks in the wrong order, and then decide how to deal with them

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Apr 18, 2024

The only safe way to recover from a deadlock IMO is

  1. make sure to detect a deadlock without false positives. This will involve MDR cross-check.
  2. out of two requests that are about to deadlock, pick a client request, if both are client requests then pick the youngest
  3. drop the chosen MDR along with its locks and start a new one from the same mclient request

The key here is to re-initialize the new MDR from scratch, which is possible given the original MClientRequest we can extract from the dropped MDR.
It may not be possible to do this safely for internal requests, so if we have two internal requests that deadlock we should probably assert.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, this looks clean and good.

if (lock->is_waiter_for(SimpleLock::WAIT_ALL)) {
any_waiter = true;
}
if (lock->get_type() > lock_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check the lock parent is the same as well.

Copy link
Member Author

@lxbsz lxbsz Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick Will this make sense ?

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

The _DN lock's parent will be different from the others always, right ?

Such as for the deadlock issue in [ceph-users] MDS Behind on Trimming..., from current block ops output it's a deadlock between CInode's _IFILE or _INEST and a _DN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm referring to SimpleLock::parent which is the CInode/CDentry the lock is protecting. You need to ensure that the parent is the same when comparing the locks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's me explain more about this.

For example firstly a request acquires:

in->filelock
in->authlock

Then later in another acquire_locks() when trying to acquire the dn->lock, if we check the parents of the locks then the order will be:

in->filelock
in->authlock
dn->lock

But actually it should be

#define CEPH_LOCK_DN (1 << 0) 
#define CEPH_LOCK_IFILE (1 << 7) 
#define CEPH_LOCK_IAUTH (1 << 10)

So IMO we should ignore the parents check anyway.

Did I miss anything here ? Hope we are on the same page :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locking order is not simply the lock type. It is documented here:

ceph/src/mds/MDCache.cc

Lines 8264 to 8281 in 367a849

/**
* In 246f647566095c173e5e0e54661696cea230f96e, an updated rule for locking order
* was established (differing from past strategies):
*
* [The helper function is for requests that operate on two paths. It
* ensures that the two paths get locks in proper order.] The rule is:
*
* 1. Lock directory inodes or dentries according to which trees they
* are under. Lock objects under fs root before objects under mdsdir.
* 2. Lock directory inodes or dentries according to their depth, in
* ascending order.
* 3. Lock directory inodes or dentries according to inode numbers or
* dentries' parent inode numbers, in ascending order.
* 4. Lock dentries in the same directory in order of their keys.
* 5. Lock non-directory inodes according to inode numbers, in ascending
* order.
*/

but I'm unsure if that captures all of the implicit (by code) rules.

}
dout(15) << " lock " << *lock << " any_waiter " << any_waiter << " order_mess " << order_mess << dendl;

if (order_mess && any_waiter) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be turned on via a config. I think it should be off by default.

QA should turn it on and this method should message the clog with an ERR that we are resolving a possible deadlock so it fails tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable and I will add it.

@lxbsz lxbsz marked this pull request as ready for review April 22, 2024 01:26
@lxbsz
Copy link
Member Author

lxbsz commented Apr 22, 2024

There must be a safe way to drop locks and restart requests, but it may require significantly more effort to do so reliably. I think we should start slow and begin by identifying requests that take incremental locks in the wrong order, and then decide how to deal with them

Yeah, what I am thinking it to make the newest request to drop the locks instead always. For the first step let's just add the code to detect the deadlocks and then see how to improve it later.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 22, 2024

The only safe way to recover from a deadlock IMO is

  1. make sure to detect a deadlock without false positives. This will involve MDR cross-check.
  2. out of two requests that are about to deadlock, pick a client request, if both are client requests then pick the youngest
  3. drop the chosen MDR along with its locks and start a new one from the same mclient request

The key here is to re-initialize the new MDR from scratch, which is possible given the original MClientRequest we can extract from the dropped MDR. It may not be possible to do this safely for internal requests, so if we have two internal requests that deadlock we should probably assert.

Yeah, this is almost the same as what I mentioned in #56805 (comment). For now we need to detect where and how the deadlock could happen, for some cases it's really strange and it's hard to figure out such as the one in [ceph-users] MDS Behind on Trimming.... Doing this will certainly involve a lot of efforts and let's do this after we can clearly figure it out with the debug logs.

- name: mds_detect_deadlocks
type: bool
level: advanced
desc: flag to detect deadlocks bebtween requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bebtween/between

type: bool
level: advanced
desc: flag to detect deadlocks bebtween requests.
fmt_desc: To enable it to detect deadlocks bebtween requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed them both. Thanks @dparmar18

any_waiter = true;
}
if (lock->get_type() > lock_type) {
dout(1) << " detected mess lock order, locks: " << mdr->locks << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean messy/messed? also i think the space after "detected" is not needed unless you'd want to have __func__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this space, because this will be in the loop and could be many similar items will be put.

Comment on lines +248 to +251
if (lock->get_type() > lock_type) {
dout(1) << " detected mess lock order, locks: " << mdr->locks << dendl;
order_mess = true;
}
Copy link
Contributor

@dparmar18 dparmar18 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just trying to understand locks better), so this is the list i see in ceph_fs.h

#define CEPH_LOCK_DN          (1 << 0)
#define CEPH_LOCK_DVERSION    (1 << 1)
#define CEPH_LOCK_IQUIESCE    (1 << 4)  /* mds internal */
#define CEPH_LOCK_ISNAP       (1 << 5)  /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY     (1 << 6)  /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE       (1 << 7)
#define CEPH_LOCK_INEST       (1 << 8)  /* mds internal */
#define CEPH_LOCK_IDFT        (1 << 9)  /* dir frag tree */
#define CEPH_LOCK_IAUTH       (1 << 10)
#define CEPH_LOCK_ILINK       (1 << 11)
#define CEPH_LOCK_IXATTR      (1 << 12)
#define CEPH_LOCK_IFLOCK      (1 << 13)  /* advisory file locks */
#define CEPH_LOCK_IVERSION    (1 << 14)  /* mds internal */

when we say lock->get_type() > lock_type is a mess, does it mean that the locks less than the lock_type would be acquired and are in-order?

Copy link
Contributor

@leonid-s-usov leonid-s-usov Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dparmar18 when multiple locks need to be taken, they should be taken in a pre-defined order. If that order will be observed by all concurrent lockers, no deadlocks will occur.

CephFS approach to this is to sort locks in the acquire_locks by the lock_type:

bool Locker::acquire_locks(const MDRequestRef& mdr,
			   MutationImpl::LockOpVec& lov,
 ...
{
  ...
  lov.sort_and_merge();
  ...

This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.

Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.

Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need. In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks.
It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leonid-s-usov thanks for taking time on this, couple of questions here:

This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.

Why/how could it be counter-productive? An already known lock set would mean better clarity?

Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.

Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need.

  • As you mentioned earlier some code paths start with a sub-set and then add locks as per their need, till now (since this patch's need came up recently) I assume things worked fine before? and maybe the quesce protocol added broke things up? or any new stuff added?
  • Shouldn't LockOpVec be used to make sure there are no such deadlocks? Rationale for this statement: it is a seq of the lock ops to be performed to ensure consistency, right? so if we're now doube-checking this then it means we aren't confident of the seq in LockOpVec? i.e. we aren't sure if the seq in the LockOpVec is in pre-defined order? shouldn't the LockOpVec be having a valid seq then? or am i misunderstanding the use case of LockOpVec and it is not that smart enough?
  • If we're dropping locks just like this, won't it be disastrous? There can be some I/O, we preempt it, the other thread/process that should've waited is now having the lock and the process earlier has some unfinished work left, doesn't this lead to data corruption/inconsistencies/loss?

In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks. It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.

so lets say a req had a xattr lock and now wants a file locks, so the MDS would drop the xattr lock then acquire the file lock and then xattr lock? and in the meantime what happens if the something else took up xattr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leonid-s-usov thanks for taking time on this, couple of questions here:

This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.

Why/how could it be counter-productive? An already known lock set would mean better clarity?

Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.

Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need.

  • As you mentioned earlier some code paths start with a sub-set and then add locks as per their need, till now (since this patch's need came up recently) I assume things worked fine before? and maybe the quesce protocol added broke things up? or any new stuff added?

Mostly it worked well before, but recently we have hit at least two bugs caused by the lock order issue, please see #52522 and https://tracker.ceph.com/issues/65607.

Not introduced by the quesce changes.

  • Shouldn't LockOpVec be used to make sure there are no such deadlocks? Rationale for this statement: it is a seq of the lock ops to be performed to ensure consistency, right? so if we're now doube-checking this then it means we aren't confident of the seq in LockOpVec? i.e. we aren't sure if the seq in the LockOpVec is in pre-defined order? shouldn't the LockOpVec be having a valid seq then? or am i misunderstanding the use case of LockOpVec and it is not that smart enough?

The LockOpVec can only guarantee the lock order in each acquire_locks(), but couldn't guarantee multiple acquire_locks().

  • If we're dropping locks just like this, won't it be disastrous? There can be some I/O, we preempt it, the other thread/process that should've waited is now having the lock and the process earlier has some unfinished work left, doesn't this lead to data corruption/inconsistencies/loss?

There are already several places simply dropping the locks and we haven't seen any report about data corruption/inconsistencies/loss caused by it yet.

We have noticed the issue you mentioned and @leonid-s-usov and me agreed we should always drop the locks for the youngest request instead.

In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks. It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.

so lets say a req had a xattr lock and now wants a file locks, so the MDS would drop the xattr lock then acquire the file lock and then xattr lock? and in the meantime what happens if the something else took up xattr?

No, for the next retry of the request the lock order won't change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝🏻 What Xiubo said!

Why/how could it be counter-productive? An already known lock set would mean better clarity?

It depends on the particular request's logic. For a complicated process, it may be too expensive to take all locks at once when there are many different paths the logic could take.

till now (since this patch's need came up recently) I assume things worked fine before?

Violating the lock order doesn't guarantee a deadlock, it just makes it possible. We have very few known deadlock cases (see Xiubo's response for the recent ones), as common ones have already been addressed. Nevertheless, we should know about any requests that break the locking order before we get into more deadlocks by slightly affecting the timings of processes, or adding new requests written correctly but executing at the "wrong" time.

If we're dropping locks just like this, won't it be disastrous?

This needs to be analyzed and concluded for every given flow, there's no generic response here.

Copy link
Contributor

@dparmar18 dparmar18 Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxbsz

We have noticed the issue you mentioned and @leonid-s-usov and me agreed we should always drop the locks for the youngest request instead.

So we're assuming that old requests should've completed their workload/IO? Is this always the case? If only the new requests are taken into consideration, that means we're okay with old requests out-of-order locks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leonid-s-usov

Violating the lock order doesn't guarantee a deadlock, it just makes it possible. We have very few known deadlock cases (see Xiubo's response for the recent ones), as common ones have already been addressed. Nevertheless, we should know about any requests that break the locking order before we get into more deadlocks by slightly affecting the timings of processes, or adding new requests written correctly but executing at the "wrong" time.

So a wrong timing would actually a good timing since it will avoid the deadlocks, is this always guaranteed in distributed systems? Can we always be sure of making the timings as-intended?

Copy link
Contributor

@leonid-s-usov leonid-s-usov Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a wrong timing would actually a good timing since it will avoid the deadlocks, is this always guaranteed in distributed systems? Can we always be sure of making the timings as-intended?

There are no timing guarantees for client-initiated processing in any system, let alone a distributed one. Even for internally scheduled processes there may be uncertain timing considerations, especially if we're running on a multi-cpu system with multiple worker threads.

While repeating usage patterns tend to produce similar timing of events, if something can occur out of order, it will do so sooner or later, subject to countless external conditions.

Taking locks in the correct order every time we eliminate the whole family of potential deadlocks, reducing our reliance on "proper timing"

So we're assuming that old requests should've completed their workload/IO? Is this always the case? If only the new requests are taken into consideration, that means we're okay with old requests out-of-order locks?

From the client's perspective, it doesn't make any difference which request got re-scheduled internally by the MDS as long as both requests finish eventually.

For one, it seems fair to let the older request proceed and the newer request retry.

Secondly, we need to better understand what it means to "retry". It's yet unclear whether all our requests are immune to a sudden replay, but I assume that should be the case given the journaling and request identifiers and other means of recoveries that Ceph is capable of.
If we're unsure, we could additionally argue that there are less chances for the new request to suffer from a retry just because it's probably done less by now, but that's a very shaky argument. However, it supports the approach of picking the newer request for the retry, letting the older proceed as we drop locks.

Finally, we should mention that retry itself is a dangerous endeavor. I don't think we are ready to introduce it, because just dropping the locks and rescheduling may not work for all requests. We need to properly reinitialize internal request state to make sure it genuinely starts anew, avoiding some unexpected states. I think that the safest way to do this would be to destroy the MDR and re-create it from MClientRequest we are holding on to, as if we just received it from the client. This is only possible for client requests, apparently; I don't think we have a good restart solution for internal requests. With this in mind, it's possible that when we detect a deadlock between a client request and an internal request, we may be forced to retry the client request even if it's older.

Another point to note is that if we go for a retry to avoid a deadlock, we need to be sure it would be a real deadlock, but it requires cross-checking all other MDRs that are waiting for any of the locks we already hold, so a cartesian product of checks.

If we decide to postpone active deadlock evasion, and instead focus on collecting info about potential deadlocks, then we can simply warn about flows that take locks out of order, and then decide what to do with those flows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, we should mention that retry itself is a dangerous endeavor. I don't think we are ready to introduce it, because just dropping the locks and rescheduling may not work for all requests. We need to properly reinitialize internal request state to make sure it genuinely starts anew, avoiding some unexpected states. I think that the safest way to do this would be to destroy the MDR and re-create it from MClientRequest we are holding on to, as if we just received it from the client. This is only possible for client requests, apparently; I don't think we have a good restart solution for internal requests. With this in mind, it's possible that when we detect a deadlock between a client request and an internal request, we may be forced to retry the client request even if it's older.

won't adding more complexity to the flow incur additional turbulence? if a newer req (internally) gets retried internally and if the client retries too, what happens? do we stop client from retrying momentarily?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't adding more complexity to the flow incur additional turbulence? if a newer req (internally) gets retried internally and if the client retries too, what happens? do we stop client from retrying momentarily?

I don't think so. The Server shouldn't care whether the request is retried by a client or internally. If we're prepared to serve the client at any time, we're prepared for everything. Idempotency is our foundation here, we need to inject the internally-retried request early enough to go through all required checks, e.g. transaction id etc.

lxbsz added 2 commits April 23, 2024 08:24
For some requests they may call the acquire_locks() more than once,
and this will break the lock orders:

    #define CEPH_LOCK_DN          (1 << 0)
    #define CEPH_LOCK_DVERSION    (1 << 1)
    #define CEPH_LOCK_IQUIESCE    (1 << 4)
    #define CEPH_LOCK_ISNAP       (1 << 5)
    #define CEPH_LOCK_IPOLICY     (1 << 6)
    #define CEPH_LOCK_IFILE       (1 << 7)
    #define CEPH_LOCK_INEST       (1 << 8)
    #define CEPH_LOCK_IDFT        (1 << 9)
    #define CEPH_LOCK_IAUTH       (1 << 10)
    #define CEPH_LOCK_ILINK       (1 << 11)
    #define CEPH_LOCK_IXATTR      (1 << 12)
    #define CEPH_LOCK_IFLOCK      (1 << 13)
    #define CEPH_LOCK_IVERSION    (1 << 14)

Then multiple requests could be deadlocked.

This commit will try to detect possible deadlocks and then drop
the locks already acquired for current request.

Fixes: https://tracker.ceph.com/issues/62123
Signed-off-by: Xiubo Li <xiubli@redhat.com>
This option will be disabled by default and will enable it for the
cephfs qa tests.

Fixes: https://tracker.ceph.com/issues/62123
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jun 28, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Jul 28, 2024
@lxbsz lxbsz reopened this Jul 29, 2024
@github-actions github-actions bot added tests and removed stale labels Jul 29, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 27, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants