mds: ensure next replay is queued on req drop#47121
Conversation
|
Still testing this locally (broken rhel dev machine is killing me). Also thinking about a way to test this. |
49b0911 to
314ba0f
Compare
|
Just leaving a note I've been unable to reproduce the original problem so there's no reason to believe this will fix the problem. I'll leave this open for now as a draft until I can think of a scenario causing this. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@batrick I've found two places that caused MDS to get stuck in the clientreplay state:
void Server::dispatch_client_request(MDRequestRef& mdr)
{
if (mdr->killed) {
dout(10) << "request " << *mdr << " was killed" << dendl;
return;
...
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
MDRequestRef mdr = mdcache->request_start(req);
if (!mdr.get())
return;
...Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue. |
Ah, great find! I'm not sure I see the significance of calling With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.
The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests. |
|
Reply Inline:)
I'm afraid it may not fully resolve our issue. We can only clean it up after a request already started from
This is a mistake 🙈 , indeed
I'm not sure, it's just a risk I think because the client can only send BTW, if we |
|
@batrick Could you please write a test for this pr? :) |
The code in this PR will make sure a killed request queues the next replay op via So I think a test may be possible but it's incredibly tricky as I believe it would require a replayed op to wait on a journal flush and then a session close comes in from the client.
I think this one is also tricky to test.
Yes, can't hurt to add that. Thanks. I plan to clean this PR up and forgo a test case for this. I think we've wrapped our head around the cause. Thank you a lot for input! |
|
On this today... |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Not all client replay requests are queued at once since [1]. We require the next request by queued when completed (unsafely) or during cleanup. Not all code paths seem to handle this [2] so move it to a generic location, MDCache::request_cleanup. Even so, this doesn't handle all errors (so we must still be careful) as sometimes we must queue the next replay request before an MDRequest is constructed [3] during some error conditions. Additionally, preserve the behavior of Server::journal_and_reply queueing the next replay op. Otherwise, must wait for the request to be durable before moving onto the next one, unnecessarily. For reproducing, two specific cases are highlighted (thanks to @Mer1997 on Github for locating these): - The request is killed by a session close / eviction while a replayed request is queued and waiting for a journal flush (e.g. dirty inest locks). - The request construction fails because the request is already in the active_requests. This could happen theoretically if a client resends the same request (same reqid) twice. The first case is most probable but very difficult to reproduce for testing purposes. The replayed op would need to wait on a journal flush (to be restarted by C_MDS_RetryRequest). Then, the request would need killed by a session close. [1] ed6a18d [2] https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2253-L2262 [3] https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2380 Fixes: https://tracker.ceph.com/issues/56577 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Not all client replay requests are queued at once since [1]. We require
the next request by queued when completed (unsafely) or during cleanup.
Not all code paths seem to handle this [2] so move it to a generic
location, MDCache::request_cleanup. Even so, this doesn't handle all
errors (so we must still be careful) as sometimes we must queue the next
replay request before an MDRequest is constructed [3] during some error
conditions.
Additionally, preserve the behavior of Server::journal_and_reply
queueing the next replay op. Otherwise, must wait for the request to be
durable before moving onto the next one, unnecessarily.
[1] ed6a18d
[2]
ceph/src/mds/Server.cc
Lines 2253 to 2262 in a6f1a1c
[3]
ceph/src/mds/Server.cc
Line 2380 in a6f1a1c
Fixes: https://tracker.ceph.com/issues/56577
Signed-off-by: Patrick Donnelly pdonnell@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows