Conversation
|
Shouldn't the order be aligned with ? |
src/mds/Server.cc
Outdated
| if ((mask & CEPH_CAP_LINK_SHARED) && !(issued & CEPH_CAP_LINK_EXCL)) | ||
| lov.add_rdlock(&ref->linklock); | ||
| if ((mask & CEPH_CAP_XATTR_SHARED) && !(issued & CEPH_CAP_XATTR_EXCL)) | ||
| lov.add_rdlock(&ref->xattrlock); |
There was a problem hiding this comment.
We don't need to actually sort these when adding to the LockOpVec:
Lines 116 to 162 in 97cb38d
because the locks are sorted before acquisition.
I think what needs to happen for
https://tracker.ceph.com/issues/62123
is to modify Locker::acquire_locks to look at the locks held by the mutation to ensure that it does not already hold locks (not necessarily in the LockOpVec) that would reverse the locking order if the locks in the LockOpVec are acquired.
That is probably an expensive consistency check so it would need to be a debug config that turns it on.
There was a problem hiding this comment.
Yeah, you are correct. I missed this part. I will have a look about this again. Thanks @batrick
There was a problem hiding this comment.
That is probably an expensive consistency check so it would need to be a debug config that turns it on.
Not just that, the check should be configurable with a flag on the acquire locks call to be able to suppress it for some edge cases where we know what we are doing.
There was a problem hiding this comment.
Actually the issue here is that when a single request triggers two 'acquire_locks()' then it will be possibly break the lock-order and then cause lock-order deadlocks between multiple requests.
Anyway we should start over the request when the second 'acquire_locks()' fails as @batrick did in #52522, which I think @batrick just fixed one part of it, not all.
@batrick @leonid-s-usov I have pushed a new commit to fix the getattr and setattr only for now, please check and if this is the correct direction I will continue to fix all the other cases.
Thanks.
This sounds a good idea, but this will change the |
Change |
@batrick Please ignore this. |
src/mds/Server.cc
Outdated
|
|
||
| if (!mds->locker->acquire_locks(mdr, lov)) | ||
| if (!mds->locker->acquire_locks(mdr, lov)) { | ||
| restart_over_the_request(mdr); |
There was a problem hiding this comment.
This doesn't look right. Anytime the Locker is nudging a lock state (perhaps due to conflicting caps) then we will drop all locks. That will prevent the request from making progress.
There was a problem hiding this comment.
Then I need to do this in another way and maybe more complex. I am thinking should we drop the newer request's locks instead of dropping current one always.
#define CEPH_LOCK_DN (1 << 0)
#define CEPH_LOCK_DVERSION (1 << 1)
#define CEPH_LOCK_IQUIESCE (1 << 4) /* mds internal */
#define CEPH_LOCK_ISNAP (1 << 5) /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY (1 << 6) /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE (1 << 7)
#define CEPH_LOCK_INEST (1 << 8) /* mds internal */
#define CEPH_LOCK_IDFT (1 << 9) /* dir frag tree */
#define CEPH_LOCK_IAUTH (1 << 10)
#define CEPH_LOCK_ILINK (1 << 11)
#define CEPH_LOCK_IXATTR (1 << 12)
#define CEPH_LOCK_IFLOCK (1 << 13) /* advisory file locks */
#define CEPH_LOCK_IVERSION (1 << 14) /* mds internal */
For example:
Step Older req1 Newer req2
1 - comes
2 - acquire_locks(ISNAP, IFILE and IAUTH) - comes
then for some reason goes to sleep
3 - acquire_locks(INEST and ILINK)
then goes to sleep too for some reasons
4 - wakes up and the second
acquire_locks(INEST and IXATTR) fails
since req2 has already held `INEST`
5 - wakes up and then the second
acquire_locks(IAUTH) will fail since req1
has held `IAUTH` .
Then req1 and req2 will be deadlocked.
I am thinking should we make the older request req1 to drop the locks instead of req2 ? In above example the seq1 will dectect possible deadlock in acquire_locks() in Step 4 firstly. So if we try to drop req1's locks, then the req2 could be finished earlier than req1.
I find the currently code is doing without considering the request's initiating time, will it be a problem ? @batrick
There was a problem hiding this comment.
What you describe is a deadlock, though not because of a lack of interfaces, but rather due to the bad locking order by req1 and/or req2.
We should be careful updating lock orders in bulk, since the current stable state may be kicked out of balance, even if there are a few places where it would deadlock today.
in general, requests should try and take all the locks they need in a single call to acquire_locks, unless they have multiple phases. In the latter case, the request should be responsible for releasing locks from phase 1 before re-acquiring them in phase 2 along with other locks. The only case when a request may keep locks from phase 1 to phase 2 is if they are ordered before the new locks from phase 2.
req1 violates this by taking INEST while holding IFILE. Whether req1 can afford dropping and then re-acquiring INEST along with IFILE and IXATTR is not known, but it's req1's responsibility to never try to lock backwards in that defined lock order.
Note that it is OK for req1 to continue holding ISNAP and IFILE, since no new locks are ordered before those two
There was a problem hiding this comment.
What you describe is a deadlock, though not because of a lack of interfaces, but rather due to the bad locking order by req1 and/or req2.
We should be careful updating lock orders in bulk, since the current stable state may be kicked out of balance, even if there are a few places where it would deadlock today.
in general, requests should try and take all the locks they need in a single call to
acquire_locks, unless they have multiple phases. In the latter case, the request should be responsible for releasing locks from phase 1 before re-acquiring them in phase 2 along with other locks. The only case when a request may keep locks from phase 1 to phase 2 is if they are ordered before the new locks from phase 2.
req1violates this by taking INEST while holding IFILE. Whetherreq1can afford dropping and then re-acquiring INEST along with IFILE and IXATTR is not known, but it'sreq1's responsibility to never try to lock backwards in that defined lock order. Note that it is OK forreq1to continue holding ISNAP and IFILE, since no new locks are ordered before those two
Unfortunately there are many cases will call the acquire_locks() twice or more. Such as for the getattr request it will do it first when traversing the path in traverse_path() and then in the _getattr() func. And this really will mess the locks order.
It's not that easy to merge them in a single acquire_locks() call.
There was a problem hiding this comment.
Yes, understood, which is why I mentioned those multi-phase requests, where each phase takes additional locks. Those should be designed to only acquire locks in ascending order
|
Following @batrick 's suggestion, we should add a trap for requests that attempt to acquire locks out of order. |
Yeah, counted this somewhere sounds reasonable. Locally I have a patch to detect this, but will simply drop the locks for the current request. What I think the most ideal approach is to drop the lock for the older request when the deadlock is detected. I will push the patch first but haven't well tested yet. |
leonid-s-usov
left a comment
There was a problem hiding this comment.
Yes, that check looks good.
My concern though is the drop_locks. I'm worried that this can lead to some unwanted side effects for requests that don't expect locks that were acquired to be dropped. Those requests may have some non-idempotent code re-executed when replayed. Also, requests may have some custom cached parameters which need to be invalidated before replaying.
I believe that our first step should be to detect all as many flows that violate the strict locking order as possible, and then decide on the actions we should take.
Please note that your current check may give (potentially, many) false positives even with any_waiters in place.
src/mds/Locker.cc
Outdated
| } | ||
|
|
||
| if (order_mess && any_waiter) { | ||
| dout(10) << " detected possible dead locks; dropping all locks" << dendl; |
There was a problem hiding this comment.
I believe this should be level 3 or even less, dropping locks is a desperate measure on our part.
src/mds/Locker.cc
Outdated
| dout(15) << " lock " << *lock << " any_waiter " << any_waiter << " order_mess " << order_mess << dendl; | ||
| } | ||
|
|
||
| if (order_mess && any_waiter) { |
There was a problem hiding this comment.
We should add another high-level warning (i.e. dout (1 < x < 5)) for order_mess alone, maybe even in health to make the corresponding FS runs fail and have our attention
src/mds/Locker.cc
Outdated
| if (order_mess && any_waiter) { | ||
| dout(10) << " detected possible dead locks; dropping all locks" << dendl; | ||
| drop_locks(mdr.get(), NULL); | ||
| mdr->drop_local_auth_pins(); |
There was a problem hiding this comment.
If we chose to drop locks on a request then we must terminate this acquire_locks call with a false and put the request on the replay queue
There was a problem hiding this comment.
For this the rdlock_start(), etc has already done it. IMO we still need to wait the corresponding locks to be ready in rdlock_start(), etc, else this request could be fired requently in a loop by doing this check here.
There was a problem hiding this comment.
it shouldn't go into an infinite loop because we assume it has progressed partially (i.e. holding some locks) and is now about to deadlock (or we think it might). When we drop the locks it has now, it will most probably get stuck on those once it re-executes. We can't afford to progress with the current call to acquire_locks, because the locks we've dropped may not be present in the lov vector.
We can't add the locks back to lov and re-lock everything right here, because it will break the request's assumptions about the atomicity of its operation.
So if we drop the locks we need to abort and re-try the request as a whole
There was a problem hiding this comment.
For this the
rdlock_start(), etchas already done it.
Yes, you are right, the request is already on a waiting list. But it's on a waiting list for a wrong lock. We need to re-run the request from scratch now so that it will restart acquiring locks in the correct order, waiting on the appropriate locks this time.
|
There must be a safe way to drop locks and restart requests, but it may require significantly more effort to do so reliably. I think we should start slow and begin by identifying requests that take incremental locks in the wrong order, and then decide how to deal with them |
|
The only safe way to recover from a deadlock IMO is
The key here is to re-initialize the new MDR from scratch, which is possible given the original |
batrick
left a comment
There was a problem hiding this comment.
Otherwise, this looks clean and good.
| if (lock->is_waiter_for(SimpleLock::WAIT_ALL)) { | ||
| any_waiter = true; | ||
| } | ||
| if (lock->get_type() > lock_type) { |
There was a problem hiding this comment.
You need to check the lock parent is the same as well.
There was a problem hiding this comment.
@batrick Will this make sense ?
#define CEPH_LOCK_DN (1 << 0)
#define CEPH_LOCK_DVERSION (1 << 1)
#define CEPH_LOCK_IQUIESCE (1 << 4) /* mds internal */
#define CEPH_LOCK_ISNAP (1 << 5) /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY (1 << 6) /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE (1 << 7)
#define CEPH_LOCK_INEST (1 << 8) /* mds internal */
#define CEPH_LOCK_IDFT (1 << 9) /* dir frag tree */
#define CEPH_LOCK_IAUTH (1 << 10)
#define CEPH_LOCK_ILINK (1 << 11)
#define CEPH_LOCK_IXATTR (1 << 12)
#define CEPH_LOCK_IFLOCK (1 << 13) /* advisory file locks */
#define CEPH_LOCK_IVERSION (1 << 14) /* mds internal */
The _DN lock's parent will be different from the others always, right ?
Such as for the deadlock issue in [ceph-users] MDS Behind on Trimming..., from current block ops output it's a deadlock between CInode's _IFILE or _INEST and a _DN.
There was a problem hiding this comment.
No, I'm referring to SimpleLock::parent which is the CInode/CDentry the lock is protecting. You need to ensure that the parent is the same when comparing the locks.
There was a problem hiding this comment.
Let's me explain more about this.
For example firstly a request acquires:
in->filelock
in->authlock
Then later in another acquire_locks() when trying to acquire the dn->lock, if we check the parents of the locks then the order will be:
in->filelock
in->authlock
dn->lock
But actually it should be
#define CEPH_LOCK_DN (1 << 0)
#define CEPH_LOCK_IFILE (1 << 7)
#define CEPH_LOCK_IAUTH (1 << 10)
So IMO we should ignore the parents check anyway.
Did I miss anything here ? Hope we are on the same page :-)
There was a problem hiding this comment.
The locking order is not simply the lock type. It is documented here:
Lines 8264 to 8281 in 367a849
but I'm unsure if that captures all of the implicit (by code) rules.
src/mds/Locker.cc
Outdated
| } | ||
| dout(15) << " lock " << *lock << " any_waiter " << any_waiter << " order_mess " << order_mess << dendl; | ||
|
|
||
| if (order_mess && any_waiter) { |
There was a problem hiding this comment.
This needs to be turned on via a config. I think it should be off by default.
QA should turn it on and this method should message the clog with an ERR that we are resolving a possible deadlock so it fails tests.
There was a problem hiding this comment.
This sounds reasonable and I will add it.
Yeah, what I am thinking it to make the newest request to drop the locks instead always. For the first step let's just add the code to detect the deadlocks and then see how to improve it later. |
Yeah, this is almost the same as what I mentioned in #56805 (comment). For now we need to detect where and how the deadlock could happen, for some cases it's really strange and it's hard to figure out such as the one in |
src/common/options/mds.yaml.in
Outdated
| - name: mds_detect_deadlocks | ||
| type: bool | ||
| level: advanced | ||
| desc: flag to detect deadlocks bebtween requests. |
src/common/options/mds.yaml.in
Outdated
| type: bool | ||
| level: advanced | ||
| desc: flag to detect deadlocks bebtween requests. | ||
| fmt_desc: To enable it to detect deadlocks bebtween requests. |
src/mds/Locker.cc
Outdated
| any_waiter = true; | ||
| } | ||
| if (lock->get_type() > lock_type) { | ||
| dout(1) << " detected mess lock order, locks: " << mdr->locks << dendl; |
There was a problem hiding this comment.
did you mean messy/messed? also i think the space after "detected" is not needed unless you'd want to have __func__
There was a problem hiding this comment.
I'd prefer to keep this space, because this will be in the loop and could be many similar items will be put.
| if (lock->get_type() > lock_type) { | ||
| dout(1) << " detected mess lock order, locks: " << mdr->locks << dendl; | ||
| order_mess = true; | ||
| } |
There was a problem hiding this comment.
(just trying to understand locks better), so this is the list i see in ceph_fs.h
#define CEPH_LOCK_DN (1 << 0)
#define CEPH_LOCK_DVERSION (1 << 1)
#define CEPH_LOCK_IQUIESCE (1 << 4) /* mds internal */
#define CEPH_LOCK_ISNAP (1 << 5) /* snapshot lock. MDS internal */
#define CEPH_LOCK_IPOLICY (1 << 6) /* policy lock on dirs. MDS internal */
#define CEPH_LOCK_IFILE (1 << 7)
#define CEPH_LOCK_INEST (1 << 8) /* mds internal */
#define CEPH_LOCK_IDFT (1 << 9) /* dir frag tree */
#define CEPH_LOCK_IAUTH (1 << 10)
#define CEPH_LOCK_ILINK (1 << 11)
#define CEPH_LOCK_IXATTR (1 << 12)
#define CEPH_LOCK_IFLOCK (1 << 13) /* advisory file locks */
#define CEPH_LOCK_IVERSION (1 << 14) /* mds internal */
when we say lock->get_type() > lock_type is a mess, does it mean that the locks less than the lock_type would be acquired and are in-order?
There was a problem hiding this comment.
@dparmar18 when multiple locks need to be taken, they should be taken in a pre-defined order. If that order will be observed by all concurrent lockers, no deadlocks will occur.
CephFS approach to this is to sort locks in the acquire_locks by the lock_type:
bool Locker::acquire_locks(const MDRequestRef& mdr,
MutationImpl::LockOpVec& lov,
...
{
...
lov.sort_and_merge();
...
This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.
Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say, xattr lock, and then they want to take the file lock, they would violate the said lock order, and acquire_locks wouldn't be able to help here, as separate calls to the method mention the locks.
Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the lock_type greater than the lowest lock_type of the new locks they need. In my example, the request would have to drop the xattr lock and then acquire both the file and the xattr locks in a single call to maintain the order and prevent potential deadlocks.
It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.
There was a problem hiding this comment.
Hey @leonid-s-usov thanks for taking time on this, couple of questions here:
This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.
Why/how could it be counter-productive? An already known lock set would mean better clarity?
Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say,
xattrlock, and then they want to take thefilelock, they would violate the said lock order, andacquire_lockswouldn't be able to help here, as separate calls to the method mention the locks.
Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the
lock_typegreater than the lowestlock_typeof the new locks they need.
- As you mentioned earlier some code paths start with a sub-set and then add locks as per their need, till now (since this patch's need came up recently) I assume things worked fine before? and maybe the quesce protocol added broke things up? or any new stuff added?
- Shouldn't
LockOpVecbe used to make sure there are no such deadlocks? Rationale for this statement: it is a seq of the lock ops to be performed to ensure consistency, right? so if we're now doube-checking this then it means we aren't confident of the seq inLockOpVec? i.e. we aren't sure if the seq in theLockOpVecis in pre-defined order? shouldn't theLockOpVecbe having a valid seq then? or am i misunderstanding the use case ofLockOpVecand it is not that smart enough? - If we're dropping locks just like this, won't it be disastrous? There can be some I/O, we preempt it, the other thread/process that should've waited is now having the lock and the process earlier has some unfinished work left, doesn't this lead to data corruption/inconsistencies/loss?
In my example, the request would have to drop the
xattrlock and then acquire both thefileand thexattrlocks in a single call to maintain the order and prevent potential deadlocks. It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.
so lets say a req had a xattr lock and now wants a file locks, so the MDS would drop the xattr lock then acquire the file lock and then xattr lock? and in the meantime what happens if the something else took up xattr?
There was a problem hiding this comment.
Hey @leonid-s-usov thanks for taking time on this, couple of questions here:
This would be sufficient if all requests always specified the full list of locks they needed upfront, but that's not the case, and in many instances, it would be counter-productive.
Why/how could it be counter-productive? An already known lock set would mean better clarity?
Some flows start by taking a subset of locks and later add more locks as they progress. Such flows must be careful, because if they began by only taking, say,
xattrlock, and then they want to take thefilelock, they would violate the said lock order, andacquire_lockswouldn't be able to help here, as separate calls to the method mention the locks.Such multi-phase flows should take the responsibility to only acquire the locks in the correct order, and if they need to violate it by taking new locks, then they have to drop any locks that they already hold with the
lock_typegreater than the lowestlock_typeof the new locks they need.
- As you mentioned earlier some code paths start with a sub-set and then add locks as per their need, till now (since this patch's need came up recently) I assume things worked fine before? and maybe the quesce protocol added broke things up? or any new stuff added?
Mostly it worked well before, but recently we have hit at least two bugs caused by the lock order issue, please see #52522 and https://tracker.ceph.com/issues/65607.
Not introduced by the quesce changes.
- Shouldn't
LockOpVecbe used to make sure there are no such deadlocks? Rationale for this statement: it is a seq of the lock ops to be performed to ensure consistency, right? so if we're now doube-checking this then it means we aren't confident of the seq inLockOpVec? i.e. we aren't sure if the seq in theLockOpVecis in pre-defined order? shouldn't theLockOpVecbe having a valid seq then? or am i misunderstanding the use case ofLockOpVecand it is not that smart enough?
The LockOpVec can only guarantee the lock order in each acquire_locks(), but couldn't guarantee multiple acquire_locks().
- If we're dropping locks just like this, won't it be disastrous? There can be some I/O, we preempt it, the other thread/process that should've waited is now having the lock and the process earlier has some unfinished work left, doesn't this lead to data corruption/inconsistencies/loss?
There are already several places simply dropping the locks and we haven't seen any report about data corruption/inconsistencies/loss caused by it yet.
We have noticed the issue you mentioned and @leonid-s-usov and me agreed we should always drop the locks for the youngest request instead.
In my example, the request would have to drop the
xattrlock and then acquire both thefileand thexattrlocks in a single call to maintain the order and prevent potential deadlocks. It's not always possible to do, because dropping xattr lock would break the atomicity of the flow wrt the values protected by the xattr lock, which may or may not be acceptable. If we can't afford it, the solution for that example flow would be to take file lock along with xattr lock from the beginning, for atomicity.so lets say a req had a xattr lock and now wants a file locks, so the MDS would drop the xattr lock then acquire the file lock and then xattr lock? and in the meantime what happens if the something else took up xattr?
No, for the next retry of the request the lock order won't change.
There was a problem hiding this comment.
☝🏻 What Xiubo said!
Why/how could it be counter-productive? An already known lock set would mean better clarity?
It depends on the particular request's logic. For a complicated process, it may be too expensive to take all locks at once when there are many different paths the logic could take.
till now (since this patch's need came up recently) I assume things worked fine before?
Violating the lock order doesn't guarantee a deadlock, it just makes it possible. We have very few known deadlock cases (see Xiubo's response for the recent ones), as common ones have already been addressed. Nevertheless, we should know about any requests that break the locking order before we get into more deadlocks by slightly affecting the timings of processes, or adding new requests written correctly but executing at the "wrong" time.
If we're dropping locks just like this, won't it be disastrous?
This needs to be analyzed and concluded for every given flow, there's no generic response here.
There was a problem hiding this comment.
We have noticed the issue you mentioned and @leonid-s-usov and me agreed we should always drop the locks for the youngest request instead.
So we're assuming that old requests should've completed their workload/IO? Is this always the case? If only the new requests are taken into consideration, that means we're okay with old requests out-of-order locks?
There was a problem hiding this comment.
Violating the lock order doesn't guarantee a deadlock, it just makes it possible. We have very few known deadlock cases (see Xiubo's response for the recent ones), as common ones have already been addressed. Nevertheless, we should know about any requests that break the locking order before we get into more deadlocks by slightly affecting the timings of processes, or adding new requests written correctly but executing at the "wrong" time.
So a wrong timing would actually a good timing since it will avoid the deadlocks, is this always guaranteed in distributed systems? Can we always be sure of making the timings as-intended?
There was a problem hiding this comment.
So a wrong timing would actually a good timing since it will avoid the deadlocks, is this always guaranteed in distributed systems? Can we always be sure of making the timings as-intended?
There are no timing guarantees for client-initiated processing in any system, let alone a distributed one. Even for internally scheduled processes there may be uncertain timing considerations, especially if we're running on a multi-cpu system with multiple worker threads.
While repeating usage patterns tend to produce similar timing of events, if something can occur out of order, it will do so sooner or later, subject to countless external conditions.
Taking locks in the correct order every time we eliminate the whole family of potential deadlocks, reducing our reliance on "proper timing"
So we're assuming that old requests should've completed their workload/IO? Is this always the case? If only the new requests are taken into consideration, that means we're okay with old requests out-of-order locks?
From the client's perspective, it doesn't make any difference which request got re-scheduled internally by the MDS as long as both requests finish eventually.
For one, it seems fair to let the older request proceed and the newer request retry.
Secondly, we need to better understand what it means to "retry". It's yet unclear whether all our requests are immune to a sudden replay, but I assume that should be the case given the journaling and request identifiers and other means of recoveries that Ceph is capable of.
If we're unsure, we could additionally argue that there are less chances for the new request to suffer from a retry just because it's probably done less by now, but that's a very shaky argument. However, it supports the approach of picking the newer request for the retry, letting the older proceed as we drop locks.
Finally, we should mention that retry itself is a dangerous endeavor. I don't think we are ready to introduce it, because just dropping the locks and rescheduling may not work for all requests. We need to properly reinitialize internal request state to make sure it genuinely starts anew, avoiding some unexpected states. I think that the safest way to do this would be to destroy the MDR and re-create it from MClientRequest we are holding on to, as if we just received it from the client. This is only possible for client requests, apparently; I don't think we have a good restart solution for internal requests. With this in mind, it's possible that when we detect a deadlock between a client request and an internal request, we may be forced to retry the client request even if it's older.
Another point to note is that if we go for a retry to avoid a deadlock, we need to be sure it would be a real deadlock, but it requires cross-checking all other MDRs that are waiting for any of the locks we already hold, so a cartesian product of checks.
If we decide to postpone active deadlock evasion, and instead focus on collecting info about potential deadlocks, then we can simply warn about flows that take locks out of order, and then decide what to do with those flows.
There was a problem hiding this comment.
Finally, we should mention that retry itself is a dangerous endeavor. I don't think we are ready to introduce it, because just dropping the locks and rescheduling may not work for all requests. We need to properly reinitialize internal request state to make sure it genuinely starts anew, avoiding some unexpected states. I think that the safest way to do this would be to destroy the MDR and re-create it from MClientRequest we are holding on to, as if we just received it from the client. This is only possible for client requests, apparently; I don't think we have a good restart solution for internal requests. With this in mind, it's possible that when we detect a deadlock between a client request and an internal request, we may be forced to retry the client request even if it's older.
won't adding more complexity to the flow incur additional turbulence? if a newer req (internally) gets retried internally and if the client retries too, what happens? do we stop client from retrying momentarily?
There was a problem hiding this comment.
won't adding more complexity to the flow incur additional turbulence? if a newer req (internally) gets retried internally and if the client retries too, what happens? do we stop client from retrying momentarily?
I don't think so. The Server shouldn't care whether the request is retried by a client or internally. If we're prepared to serve the client at any time, we're prepared for everything. Idempotency is our foundation here, we need to inject the internally-retried request early enough to go through all required checks, e.g. transaction id etc.
For some requests they may call the acquire_locks() more than once,
and this will break the lock orders:
#define CEPH_LOCK_DN (1 << 0)
#define CEPH_LOCK_DVERSION (1 << 1)
#define CEPH_LOCK_IQUIESCE (1 << 4)
#define CEPH_LOCK_ISNAP (1 << 5)
#define CEPH_LOCK_IPOLICY (1 << 6)
#define CEPH_LOCK_IFILE (1 << 7)
#define CEPH_LOCK_INEST (1 << 8)
#define CEPH_LOCK_IDFT (1 << 9)
#define CEPH_LOCK_IAUTH (1 << 10)
#define CEPH_LOCK_ILINK (1 << 11)
#define CEPH_LOCK_IXATTR (1 << 12)
#define CEPH_LOCK_IFLOCK (1 << 13)
#define CEPH_LOCK_IVERSION (1 << 14)
Then multiple requests could be deadlocked.
This commit will try to detect possible deadlocks and then drop
the locks already acquired for current request.
Fixes: https://tracker.ceph.com/issues/62123
Signed-off-by: Xiubo Li <xiubli@redhat.com>
This option will be disabled by default and will enable it for the cephfs qa tests. Fixes: https://tracker.ceph.com/issues/62123 Signed-off-by: Xiubo Li <xiubli@redhat.com>
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
This will just make the locks order as:
1, filelock
2, netstlock
3, authlock
4, linklock
5, snaplock
6, xattrlock
7, policylock
8, versionlock
9, flocklock
10, dirfragtreelock
Fixes: https://tracker.ceph.com/issues/62123
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e