Skip to content

mds: ensure next replay is queued on req drop#47121

Merged
batrick merged 1 commit intoceph:mainfrom
batrick:i56577
Nov 2, 2023
Merged

mds: ensure next replay is queued on req drop#47121
batrick merged 1 commit intoceph:mainfrom
batrick:i56577

Conversation

@batrick
Copy link
Member

@batrick batrick commented Jul 15, 2022

Not all client replay requests are queued at once since [1]. We require
the next request by queued when completed (unsafely) or during cleanup.
Not all code paths seem to handle this [2] so move it to a generic
location, MDCache::request_cleanup. Even so, this doesn't handle all
errors (so we must still be careful) as sometimes we must queue the next
replay request before an MDRequest is constructed [3] during some error
conditions.

Additionally, preserve the behavior of Server::journal_and_reply
queueing the next replay op. Otherwise, must wait for the request to be
durable before moving onto the next one, unnecessarily.

[1] ed6a18d
[2]

ceph/src/mds/Server.cc

Lines 2253 to 2262 in a6f1a1c

if (req->is_queued_for_replay() &&
(mdr->has_completed || reply->get_result() < 0)) {
if (reply->get_result() < 0) {
int r = reply->get_result();
derr << "reply_client_request: failed to replay " << *req
<< " error " << r << " (" << cpp_strerror(r) << ")" << dendl;
mds->clog->warn() << "failed to replay " << req->get_reqid() << " error " << r;
}
mds->queue_one_replay();
}

[3]
mds->queue_one_replay();

Fixes: https://tracker.ceph.com/issues/56577
Signed-off-by: Patrick Donnelly pdonnell@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick batrick added the cephfs Ceph File System label Jul 15, 2022
@batrick
Copy link
Member Author

batrick commented Jul 15, 2022

Still testing this locally (broken rhel dev machine is killing me). Also thinking about a way to test this.

@batrick batrick force-pushed the i56577 branch 2 times, most recently from 49b0911 to 314ba0f Compare July 28, 2022 01:03
@vshankar vshankar requested a review from a team August 5, 2022 13:12
@batrick
Copy link
Member Author

batrick commented Aug 8, 2022

Just leaving a note I've been unable to reproduce the original problem so there's no reason to believe this will fix the problem. I'll leave this open for now as a draft until I can think of a scenario causing this.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@Mer1997
Copy link
Contributor

Mer1997 commented Mar 22, 2023

@batrick I've found two places that caused MDS to get stuck in the clientreplay state:

  1. When journal flushed and comes in RetryRequest (inest lock was dirty when first acquire_lock), it will directly enter dispatch_client_request instead of handle_client_request, which will miss the opportunity to queue_one_replay and return if the session was killed. (This is relatively more common)
void Server::dispatch_client_request(MDRequestRef& mdr)
{
  if (mdr->killed) {
    dout(10) << "request " << *mdr << " was killed" << dendl;
    return;
...
  1. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...

Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.

@batrick
Copy link
Member Author

batrick commented Mar 22, 2023

@batrick I've found two places that caused MDS to get stuck in the clientreplay state:

1. When journal flushed and comes in RetryRequest (inest lock was dirty when first acquire_lock), it will directly enter dispatch_client_request instead of handle_client_request, which will miss the opportunity to queue_one_replay and return if the session was killed. (This is relatively more common)
void Server::dispatch_client_request(MDRequestRef& mdr)
{
  if (mdr->killed) {
    dout(10) << "request " << *mdr << " was killed" << dendl;
    return;
...

Ah, great find! I'm not sure I see the significance of calling dispatch_client_request instead of handle_client_request though. IIRC, the code assumed that replayed requests would go through either journal_and_reply or reply_client_request. So that's where all queuing would occur for the next replayed request. So, I don't see how handle_client_request would have queued the next op in the event of a session getting killed.

With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.

2. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...

Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.

The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests.

@Mer1997
Copy link
Contributor

Mer1997 commented Mar 27, 2023

Reply Inline:)

With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.

I'm afraid it may not fully resolve our issue. We can only clean it up after a request already started from request_start, and if a message from the client returns before it, we won't be able to complete the cleanup and queue_one_replay (as in the scenario mentioned in the second point).

@batrick I've found two places that caused MDS to get stuck in the clientreplay state:

1. When journal flushed and comes in RetryRequest (inest lock was dirty when first acquire_lock), it will directly enter dispatch_client_request instead of handle_client_request, which will miss the opportunity to queue_one_replay and return if the session was killed. (This is relatively more common)
void Server::dispatch_client_request(MDRequestRef& mdr)
{
  if (mdr->killed) {
    dout(10) << "request " << *mdr << " was killed" << dendl;
    return;
...

Ah, great find! I'm not sure I see the significance of calling dispatch_client_request instead of handle_client_request though. IIRC, the code assumed that replayed requests would go through either journal_and_reply or reply_client_request. So that's where all queuing would occur for the next replayed request. So, I don't see how handle_client_request would have queued the next op in the event of a session getting killed.

This is a mistake 🙈 , indeed MDCache::dispatch_request is the correct retry path after waiting for the inest to be unmarked as dirty (after scatter_writebehind finish). MDS won't do anything because the request is in the waiting list (WAIT_STABLE).

2. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...

Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.

The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests.

I'm not sure, it's just a risk I think because the client can only send unsafe_requests and old_requests once when mds is in RECONNECT phase, and if the client disconnects, it has to wait for mds's status to convert to ACTIVE.


BTW, if we queue_one_replay in MDCache::request_cleanup, we might queuing more than one request if the session is killed (there is a for loop to call request_kill, in Server::journal_close_session). This would break the old rule of processing requests one by one during the clientreplay phase, and I'm not sure if it's correct. (maybe we should set_queued_next_replay_op not only in journal_and_reply?)

@github-actions github-actions bot added the stale label May 26, 2023
@github-actions github-actions bot closed this Jun 25, 2023
@batrick batrick reopened this Aug 8, 2023
@github-actions github-actions bot removed the stale label Aug 8, 2023
@Mer1997
Copy link
Contributor

Mer1997 commented Sep 26, 2023

@batrick Could you please write a test for this pr? :)

@ceph ceph deleted a comment from github-actions bot Oct 18, 2023
@ceph ceph deleted a comment from github-actions bot Oct 18, 2023
@batrick
Copy link
Member Author

batrick commented Oct 18, 2023

Reply Inline:)

With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.

I'm afraid it may not fully resolve our issue. We can only clean it up after a request already started from request_start, and if a message from the client returns before it, we won't be able to complete the cleanup and queue_one_replay (as in the scenario mentioned in the second point).

The code in this PR will make sure a killed request queues the next replay op via request_cleanup. That's called when the request is killed (unless it is already committing, and Server::journal_and_reply will have already queued the next replay op).

So I think a test may be possible but it's incredibly tricky as I believe it would require a replayed op to wait on a journal flush and then a session close comes in from the client.

2. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...

Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.

The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests.

I'm not sure, it's just a risk I think because the client can only send unsafe_requests and old_requests once when mds is in RECONNECT phase, and if the client disconnects, it has to wait for mds's status to convert to ACTIVE.

I think this one is also tricky to test.

BTW, if we queue_one_replay in MDCache::request_cleanup, we might queuing more than one request if the session is killed (there is a for loop to call request_kill, in Server::journal_close_session). This would break the old rule of processing requests one by one during the clientreplay phase, and I'm not sure if it's correct. (maybe we should set_queued_next_replay_op not only in journal_and_reply?)

Yes, can't hurt to add that. Thanks.

I plan to clean this PR up and forgo a test case for this. I think we've wrapped our head around the cause. Thank you a lot for input!

@vshankar
Copy link
Contributor

On this today...

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done 👍

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Not all client replay requests are queued at once since [1]. We require
the next request by queued when completed (unsafely) or during cleanup.
Not all code paths seem to handle this [2] so move it to a generic
location, MDCache::request_cleanup. Even so, this doesn't handle all
errors (so we must still be careful) as sometimes we must queue the next
replay request before an MDRequest is constructed [3] during some error
conditions.

Additionally, preserve the behavior of Server::journal_and_reply
queueing the next replay op. Otherwise, must wait for the request to be
durable before moving onto the next one, unnecessarily.

For reproducing, two specific cases are highlighted (thanks to @Mer1997 on
Github for locating these):

- The request is killed by a session close / eviction while a replayed request
  is queued and waiting for a journal flush (e.g. dirty inest locks).

- The request construction fails because the request is already in the
  active_requests. This could happen theoretically if a client resends the same
  request (same reqid) twice.

The first case is most probable but very difficult to reproduce for testing
purposes. The replayed op would need to wait on a journal flush (to be
restarted by C_MDS_RetryRequest).  Then, the request would need killed by a
session close.

[1] ed6a18d
[2] https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2253-L2262
[3] https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2380

Fixes: https://tracker.ceph.com/issues/56577
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented Nov 2, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants