mds: getattr just waits the xlock to be released by the previous client#56602
mds: getattr just waits the xlock to be released by the previous client#56602
Conversation
|
also, possibly update the PR title and the commit title to: |
This looks good to me and fixed them all. Thanks @mchangir |
|
leonid-s-usov
left a comment
There was a problem hiding this comment.
I'd like to discuss this. The code doesn't look right, it's trying to hack the locking system. We don't want to add such code outside of the Locker, and even there it's smelly.
Using the projected inode in requests from the holder of the xlock is a feature: it means that the client gets back the recent changes even if they haven't been committed yet.
Going back to the ticket I'm not sure we're fixing the issue at the correct place. If chmod in POSIX is a synchronous operation, then we should've waited for the change to be committed, before we allowed another query to propagate to the MDS.
I haven't yet found out whether fsync is required after a chmod, but in any case, I doubt that we need to fix it on the MDS side
leonid-s-usov
left a comment
There was a problem hiding this comment.
I forgot to request changes. See my previous review comment
Then where should it be ?
Yeah, correct, but this will only readable by the client holding the
Yeah, this change is obviously doing this. This should be opaque to users and we shouldn't fail it by telling users that MDS is not ready yet and you should wait for a while to do the query.
No, this isn't a must, we shouldn't ask users to do this because the POSIX doesn't mention it. Even we do @leonid-s-usov Any better approach to resolve this ? |
gregsfortytwo
left a comment
There was a problem hiding this comment.
Hmm I agree with Leonid here, we should definitely not be poking into the locking system.
Looking at the tracker ticket, @lxbsz notes that we are batching the getattr from two separate clients, one of which has the xlock (so it can see the projected inode, and the other client can't). I suspect this is the root cause of the issue — a non-xlocker should not read out-of-date inodes; the locking system should block that data access until the inode is stable.
This sounds to me like a bug with the batching implementation, which I have not investigated. :( But I suspect we need to adjust the batching system so that it doesn't batch operations from clients with different caps?
|
If this isn't something caused by batching, then it is very scary. You can see down at https://github.com/ceph/ceph/pull/56602/files#diff-277dc6e796ccecb6aa14c9357f7a86898d6ddcf4113c110a4e00e0a10f6fefa7R4189 where we require the rdlock on the authlock, and I believe that should prevent exactly the read-stable-while-there-is-a-projected-inode issue described in the ticket? |
Done, Thanks @mchangir |
It has already done this. Only for the requests having the same mask will they be batched. Maybe we can just avoid batching non-xlocked requests to the xlocked one. |
|
@batrick there's a FIXME in handle_client_getattr stemming from #27866 — do you have any idea what's going on there? I don't really understand how what @lxbsz described in https://tracker.ceph.com/issues/63906#note-9 is possible — you can clearly see So how can we possibly be returning data while there's a projected inode outstanding? |
Oh, yes, I said "caps" but I suppose you can still have an xlock assigned to you without different issued caps. This is what I meant. |
I just made a mistake and mislead you in our 1, So here we need to make sure
I think this is because once the
Once the |
|
Updated the patch and the new change will avoid batching the ops when any of the |
|
I think it all starts in the The comment suggests that we should somehow expose projected changes. Now, the and that field is what's causing the That check doesn't comply with the comment in |
I think the comments is misleading and it should be exposing uncommitted changes between |
The client that holds the xlock will be able to read (projected) immediately in the XLOCKDONE state, while the other will have to wait. But both will return the new projected value. You are right, @lxbsz, preventing the batch should help. But I'm still not sure about the |
The |
src/mds/Server.cc
Outdated
| if (((mask & CEPH_CAP_LINK_SHARED) && (in->linklock.is_xlocked())) || | ||
| ((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked())) || | ||
| ((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked())) || | ||
| ((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked()))) { |
There was a problem hiding this comment.
We can reduce impact by making sure that the xlocker is never batched. This will allow batching multiple non-xlocker clients even if some of the xlocks are held, which should be safe as long as the xlocker isn't the batch head.
| if (((mask & CEPH_CAP_LINK_SHARED) && (in->linklock.is_xlocked())) || | |
| ((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked())) || | |
| ((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked())) || | |
| ((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked()))) { | |
| if (((mask & CEPH_CAP_LINK_SHARED) && (in->linklock.is_xlocked_by_client(client))) || | |
| ((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) || | |
| ((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) || | |
| ((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) { |
There was a problem hiding this comment.
Yeah, this will be better.
There was a problem hiding this comment.
Fixed it and thanks @leonid-s-usov @gregsfortytwo
There was a problem hiding this comment.
Wait, please double check, my patch above is probably wrong, but the idea stays: don't batch when the xlocker client is the head.
There was a problem hiding this comment.
This is fine. All the xlocker client requests won't be batched and even won't be set as batch heads.
With this change all the non-locker client requests could be batched and the batch heads will acquire the rdlock and then wait.
…s client When the previous client's setattr request is still holding the xlock for the linklock/authlock/xattrlock/filelock locks, if the same client send a getattr request it will use the projected inode to fill the reply, while for other clients the getattr requests will use the non-projected inode to fill replies. This causes inconsistent file mode across multiple clients. This will just skip batching the ops when any of the xlock is held. Fixes: https://tracker.ceph.com/issues/63906 Signed-off-by: Xiubo Li <xiubli@redhat.com>
gregsfortytwo
left a comment
There was a problem hiding this comment.
We are rapidly approaching the point where I want us to move the batching into a proper interface instead of open-coding it, but this LGTM for now. :)
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
jenkins test api |
batrick
left a comment
There was a problem hiding this comment.
Yet another example where synthetic cap requests would be really handy for testing.
| ((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) || | ||
| ((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) || | ||
| ((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) { | ||
| r = -1; |
There was a problem hiding this comment.
There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like Ax. Consider:
client 1: issued pAx
client 2: getattr pAsLsXsFs issued pLxXx
client 3: getattr pAsLsXsFs issued p
client 2 will skip acquiring the linklock/xattrlock because it's issued LxXx. It will block on recall of Ax from client 1.
Client 2's getattr cannot be a batch head for client 3's `getattr.
This all exposes a problem I believe exists with where we're constructing the batch head: this should be constructed only after this acquire_locks fails:
Lines 4217 to 4218 in 2c16096
Before that point, record whether we've skipped any locks due to issued caps or (for this particular bug) the lock is already xlocked by the client. The latter cannot be easily checked without looking at the inode's lock because the Locker state machine hides why a rdlock succeeds (in this case, the client has an xlock already).
There was a problem hiding this comment.
There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like
Ax. Consider:client 1: issued
pAxclient 2:getattr pAsLsXsFsissuedpLxXxclient 3:getattr pAsLsXsFsissuedp
@batrick Correct me if I am wrong here.
From mds/lock.c I can see that only the loner client could get the x caps. Could you point me out in which case could we see the x in different locks will be issued to different clients at the same time ?
Different locks could have different loners at the same time ?
There was a problem hiding this comment.
There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like
Ax. Consider:
client 1: issuedpAxclient 2:getattr pAsLsXsFsissuedpLxXxclient 3:getattr pAsLsXsFsissuedp@batrick Correct me if I am wrong here.
From
mds/lock.cI can see that only the loner client could get thexcaps. Could you point me out in which case could we see thexin different locks will be issued to different clients at the same time ?
You might be right; I'm not sure. In principle I don't see why it wouldn't be allowed but the state diagram suggests it is not.
Different locks could have different loners at the same time ?
It doesn't look like it but it's worth checking.
I think the batch leader construction should still be moved however. And that "detail" of the loner client shouldn't be relied on for the checks in any case.
There was a problem hiding this comment.
There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like
Ax. Consider:
client 1: issuedpAxclient 2:getattr pAsLsXsFsissuedpLxXxclient 3:getattr pAsLsXsFsissuedp@batrick Correct me if I am wrong here.
Frommds/lock.cI can see that only the loner client could get thexcaps. Could you point me out in which case could we see thexin different locks will be issued to different clients at the same time ?You might be right; I'm not sure. In principle I don't see why it wouldn't be allowed but the state diagram suggests it is not.
Different locks could have different loners at the same time ?
It doesn't look like it but it's worth checking.
I think the batch leader construction should still be moved however. And that "detail" of the loner client shouldn't be relied on for the checks in any case.
I couldn't remember I ever saw this case, just now I checked some debug logs, such as the debug logs from:
ceph-post-file: fb9a96f9-5f6d-46b4-b1fa-8580928b2241
I didn't find any case will do this. And also by going through the mds code, only when a longer is successfully set will the EXCL locker state could be set. And all the lockers in a CInode will be set to the same single longer.
There was a problem hiding this comment.
Then perhaps the better check is: "is this client the loner for the CInode? In that case, do not make it a batch head".
Let's move the batch code below this call to acquire_locks on failure:
Lines 4217 to 4218 in 2c16096
If we fail to acquire the locks, then make it the batch head if one does not exist. If a batch head does exist already, then drop locks and add it to the batch queue.
There was a problem hiding this comment.
Let's move the batch code below this call to
acquire_lockson failure:
Lines 4217 to 4218 in 2c16096
If we fail to acquire the locks, then make it the batch head if one does not exist. If a batch head does exist already, then drop locks and add it to the batch queue.
Sure.
Checked the code carefully again, if my understanding is correct this won't work well as expected ?
For example, if the first lookup request just wants to acquire the rdlock for linklock and it succeeds, and then later the second lookup request comes and also just wants to the rdlock for the linklock. If both these two requests succeeds there won't be any chance to batch them. Actually we should batch them, right ?
There was a problem hiding this comment.
There are no further waits after that acquire_locks so I don't think so?
There was a problem hiding this comment.
Yeah, I think you should right here. Let me check it more.
There was a problem hiding this comment.
@batrick If we will move the batch code after acquire_locks, we also should adjust the acquire_locks and other callers to make sure that they won't add the current request to any waiter, which will retry the request later, and then try to batch this request.
There was a problem hiding this comment.
Hm, that's right. Let's leave it this way then.
| ((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) || | ||
| ((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) || | ||
| ((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) { | ||
| r = -1; |
There was a problem hiding this comment.
Then perhaps the better check is: "is this client the loner for the CInode? In that case, do not make it a batch head".
Let's move the batch code below this call to acquire_locks on failure:
Lines 4217 to 4218 in 2c16096
If we fail to acquire the locks, then make it the batch head if one does not exist. If a batch head does exist already, then drop locks and add it to the batch queue.
|
This PR is under test in https://tracker.ceph.com/issues/66850. |
|
This PR is under test in https://tracker.ceph.com/issues/67089. |
|
I'm seeing some new failures in the branch which this PR is a part of. Trying to isolate the problematic change. Will update when done. |
|
This PR is under test in https://tracker.ceph.com/issues/67318. |
| ((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) || | ||
| ((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) || | ||
| ((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) { | ||
| r = -1; |
There was a problem hiding this comment.
Hm, that's right. Let's leave it this way then.
* refs/pull/56602/head: mds: always make getattr wait for xlock to be released by the previous client Reviewed-by: Leonid Usov <leonid.usov@ibm.com> Reviewed-by: Greg Farnum <gfarnum@redhat.com>
* refs/pull/56602/head: mds: always make getattr wait for xlock to be released by the previous client Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Leonid Usov <leonid.usov@ibm.com>
When the previous client's setattr request is still holding the xlock for the linklock/authlock/xattrlock/filelock locks, if the same client send a getattr request it will use the projected inode to fill the reply, while for other clients the getattr requests will use the none projected inode to fill replies. This just cause inconsistent file mode across multiple clients.
This will just let the getattr wait until the previous client release the xlock.
Fixes: https://tracker.ceph.com/issues/63906
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e