mds: ensure next replay is queued on req drop by batrick · Pull Request #47121 · ceph/ceph

batrick · 2022-07-15T20:58:02Z

Not all client replay requests are queued at once since [1]. We require
the next request by queued when completed (unsafely) or during cleanup.
Not all code paths seem to handle this [2] so move it to a generic
location, MDCache::request_cleanup. Even so, this doesn't handle all
errors (so we must still be careful) as sometimes we must queue the next
replay request before an MDRequest is constructed [3] during some error
conditions.

Additionally, preserve the behavior of Server::journal_and_reply
queueing the next replay op. Otherwise, must wait for the request to be
durable before moving onto the next one, unnecessarily.

[1] ed6a18d
[2]

ceph/src/mds/Server.cc

Lines 2253 to 2262 in a6f1a1c

    
           if (req->is_queued_for_replay() && 
        
               (mdr->has_completed || reply->get_result() < 0)) { 
        
             if (reply->get_result() < 0) { 
        
               int r = reply->get_result(); 
        
               derr << "reply_client_request: failed to replay " << *req 
        
             << " error " << r << " (" << cpp_strerror(r)  << ")" << dendl; 
        
               mds->clog->warn() << "failed to replay " << req->get_reqid() << " error " << r; 
        
             } 
        
             mds->queue_one_replay(); 
        
           }

[3]

ceph/src/mds/Server.cc

Line 2380 in a6f1a1c

mds->queue_one_replay();

Fixes: https://tracker.ceph.com/issues/56577
Signed-off-by: Patrick Donnelly pdonnell@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

batrick · 2022-07-15T20:58:47Z

Still testing this locally (broken rhel dev machine is killing me). Also thinking about a way to test this.

batrick · 2022-08-08T17:55:47Z

Just leaving a note I've been unable to reproduce the original problem so there's no reason to believe this will fix the problem. I'll leave this open for now as a draft until I can think of a scenario causing this.

github-actions · 2022-10-11T08:01:57Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Mer1997 · 2023-03-22T10:12:59Z

@batrick I've found two places that caused MDS to get stuck in the clientreplay state:

When journal flushed and comes in RetryRequest (inest lock was dirty when first acquire_lock), it will directly enter dispatch_client_request instead of handle_client_request, which will miss the opportunity to queue_one_replay and return if the session was killed. (This is relatively more common)

void Server::dispatch_client_request(MDRequestRef& mdr)
{
  if (mdr->killed) {
    dout(10) << "request " << *mdr << " was killed" << dendl;
    return;
...

When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.

void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...

Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.

batrick · 2023-03-22T19:00:46Z

@batrick I've found two places that caused MDS to get stuck in the clientreplay state:

1. When journal flushed and comes in RetryRequest (inest lock was dirty when first acquire_lock), it will directly enter dispatch_client_request instead of handle_client_request, which will miss the opportunity to queue_one_replay and return if the session was killed. (This is relatively more common)

void Server::dispatch_client_request(MDRequestRef& mdr)
{
  if (mdr->killed) {
    dout(10) << "request " << *mdr << " was killed" << dendl;
    return;
...

Ah, great find! I'm not sure I see the significance of calling dispatch_client_request instead of handle_client_request though. IIRC, the code assumed that replayed requests would go through either journal_and_reply or reply_client_request. So that's where all queuing would occur for the next replayed request. So, I don't see how handle_client_request would have queued the next op in the event of a session getting killed.

With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.

2. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.

void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...

Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.

The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests.

Mer1997 · 2023-03-27T12:01:45Z

Reply Inline:)

With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.

I'm afraid it may not fully resolve our issue. We can only clean it up after a request already started from request_start, and if a message from the client returns before it, we won't be able to complete the cleanup and queue_one_replay (as in the scenario mentioned in the second point).

@batrick I've found two places that caused MDS to get stuck in the clientreplay state:
1. When journal flushed and comes in RetryRequest (inest lock was dirty when first acquire_lock), it will directly enter dispatch_client_request instead of handle_client_request, which will miss the opportunity to queue_one_replay and return if the session was killed. (This is relatively more common)
void Server::dispatch_client_request(MDRequestRef& mdr)
{
  if (mdr->killed) {
    dout(10) << "request " << *mdr << " was killed" << dendl;
    return;
...
Ah, great find! I'm not sure I see the significance of calling dispatch_client_request instead of handle_client_request though. IIRC, the code assumed that replayed requests would go through either journal_and_reply or reply_client_request. So that's where all queuing would occur for the next replayed request. So, I don't see how handle_client_request would have queued the next op in the event of a session getting killed.

This is a mistake 🙈 , indeed MDCache::dispatch_request is the correct retry path after waiting for the inest to be unmarked as dirty (after scatter_writebehind finish). MDS won't do anything because the request is in the waiting list (WAIT_STABLE).

2. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...
Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.
The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests.

I'm not sure, it's just a risk I think because the client can only send unsafe_requests and old_requests once when mds is in RECONNECT phase, and if the client disconnects, it has to wait for mds's status to convert to ACTIVE.

BTW, if we queue_one_replay in MDCache::request_cleanup, we might queuing more than one request if the session is killed (there is a for loop to call request_kill, in Server::journal_close_session). This would break the old rule of processing requests one by one during the clientreplay phase, and I'm not sure if it's correct. (maybe we should set_queued_next_replay_op not only in journal_and_reply?)

Mer1997 · 2023-09-26T02:49:35Z

@batrick Could you please write a test for this pr? :)

src/mds/Server.cc

batrick · 2023-10-18T00:39:59Z

Reply Inline:)

With that said, this sounds like a reasonable cause. Would you agree the current methodology of this PR should address this case? I will write a test.

I'm afraid it may not fully resolve our issue. We can only clean it up after a request already started from request_start, and if a message from the client returns before it, we won't be able to complete the cleanup and queue_one_replay (as in the scenario mentioned in the second point).

The code in this PR will make sure a killed request queues the next replay op via request_cleanup. That's called when the request is killed (unless it is already committing, and Server::journal_and_reply will have already queued the next replay op).

So I think a test may be possible but it's incredibly tricky as I believe it would require a replayed op to wait on a journal flush and then a session close comes in from the client.

2. When request_start fails in handle_client_request, it will directly return without queueing one replay (possibly due to duplicate messages), which is relatively more rare.
void Server::handle_client_request(const MClientRequest::const_ref &req)
{
...
  MDRequestRef mdr = mdcache->request_start(req);
  if (!mdr.get())
    return;
...
Both of these situations will cause MDS to get stuck in the clientreplay state and cannot continue.
The duplicate message case is interesting but easy to simulate. I think the "natural" way to do this is have a client disconnect from the MDS and then reconnect? An easier hack would be to have a client config that sends duplicate replay requests.
I'm not sure, it's just a risk I think because the client can only send unsafe_requests and old_requests once when mds is in RECONNECT phase, and if the client disconnects, it has to wait for mds's status to convert to ACTIVE.

I think this one is also tricky to test.

BTW, if we queue_one_replay in MDCache::request_cleanup, we might queuing more than one request if the session is killed (there is a for loop to call request_kill, in Server::journal_close_session). This would break the old rule of processing requests one by one during the clientreplay phase, and I'm not sure if it's correct. (maybe we should set_queued_next_replay_op not only in journal_and_reply?)

Yes, can't hurt to add that. Thanks.

I plan to clean this PR up and forgo a test case for this. I think we've wrapped our head around the cause. Thank you a lot for input!

vshankar · 2023-10-19T11:25:18Z

On this today...

vshankar

Nicely done 👍

github-actions · 2023-10-20T05:44:36Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@Mer1997

Not all client replay requests are queued at once since [1]. We require the next request by queued when completed (unsafely) or during cleanup. Not all code paths seem to handle this [2] so move it to a generic location, MDCache::request_cleanup. Even so, this doesn't handle all errors (so we must still be careful) as sometimes we must queue the next replay request before an MDRequest is constructed [3] during some error conditions. Additionally, preserve the behavior of Server::journal_and_reply queueing the next replay op. Otherwise, must wait for the request to be durable before moving onto the next one, unnecessarily. For reproducing, two specific cases are highlighted (thanks to @Mer1997 on Github for locating these): - The request is killed by a session close / eviction while a replayed request is queued and waiting for a journal flush (e.g. dirty inest locks). - The request construction fails because the request is already in the active_requests. This could happen theoretically if a client resends the same request (same reqid) twice. The first case is most probable but very difficult to reproduce for testing purposes. The replayed op would need to wait on a journal flush (to be restarted by C_MDS_RetryRequest). Then, the request would need killed by a session close. [1] ed6a18d [2] https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2253-L2262 [3] https://github.com/ceph/ceph/blob/a6f1a1c6c09d74f5918c715b05789f34f2ea0e90/src/mds/Server.cc#L2380 Fixes: https://tracker.ceph.com/issues/56577 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

batrick · 2023-11-02T19:37:22Z

https://tracker.ceph.com/projects/cephfs/wiki/Main#24-October-2023

batrick added the cephfs Ceph File System label Jul 15, 2022

batrick force-pushed the i56577 branch 2 times, most recently from 49b0911 to 314ba0f Compare July 28, 2022 01:03

vshankar requested a review from a team August 5, 2022 13:12

github-actions bot added the needs-rebase label Oct 11, 2022

github-actions bot added the stale label May 26, 2023

github-actions bot closed this Jun 25, 2023

batrick reopened this Aug 8, 2023

github-actions bot removed the stale label Aug 8, 2023

vshankar reviewed Oct 12, 2023

View reviewed changes

src/mds/Server.cc Show resolved Hide resolved

ceph deleted a comment from github-actions bot Oct 18, 2023

batrick force-pushed the i56577 branch from 314ba0f to adde9b2 Compare October 18, 2023 01:03

batrick marked this pull request as ready for review October 18, 2023 01:03

github-actions bot removed the needs-rebase label Oct 18, 2023

batrick added the needs-review label Oct 18, 2023

batrick requested review from a team and vshankar October 18, 2023 01:03

batrick mentioned this pull request Oct 18, 2023

mds: add dispatch killpoint and delay configs #54067

Merged

14 tasks

batrick force-pushed the i56577 branch from adde9b2 to 1049de2 Compare October 18, 2023 21:03

vshankar approved these changes Oct 19, 2023

View reviewed changes

batrick added needs-qa and removed needs-review labels Oct 19, 2023

github-actions bot added the needs-rebase label Oct 20, 2023

batrick force-pushed the i56577 branch from 1049de2 to 078ecaa Compare October 20, 2023 13:47

github-actions bot removed the needs-rebase label Oct 20, 2023

batrick added the wip-pdonnell-testing label Oct 24, 2023

batrick merged commit 97961ae into ceph:main Nov 2, 2023

batrick deleted the i56577 branch November 2, 2023 19:42

This was referenced Nov 3, 2023

reef: mds: ensure next replay is queued on req drop #54313

Merged

pacific: mds: ensure next replay is queued on req drop #54314

Merged

quincy: mds: ensure next replay is queued on req drop #54315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mds: ensure next replay is queued on req drop#47121

mds: ensure next replay is queued on req drop#47121
batrick merged 1 commit intoceph:mainfrom
batrick:i56577

batrick commented Jul 15, 2022 •

edited

Loading

Uh oh!

batrick commented Jul 15, 2022

Uh oh!

batrick commented Aug 8, 2022

Uh oh!

github-actions bot commented Oct 11, 2022

Uh oh!

Mer1997 commented Mar 22, 2023

Uh oh!

batrick commented Mar 22, 2023

Uh oh!

Mer1997 commented Mar 27, 2023 •

edited

Loading

Uh oh!

Mer1997 commented Sep 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

batrick commented Oct 18, 2023

Uh oh!

vshankar commented Oct 19, 2023

Uh oh!

vshankar left a comment

Uh oh!

github-actions bot commented Oct 20, 2023

Uh oh!

batrick commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if (req->is_queued_for_replay() &&
	(mdr->has_completed \|\| reply->get_result() < 0)) {
	if (reply->get_result() < 0) {
	int r = reply->get_result();
	derr << "reply_client_request: failed to replay " << *req
	<< " error " << r << " (" << cpp_strerror(r) << ")" << dendl;
	mds->clog->warn() << "failed to replay " << req->get_reqid() << " error " << r;
	}
	mds->queue_one_replay();
	}

Conversation

batrick commented Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

batrick commented Jul 15, 2022

Uh oh!

batrick commented Aug 8, 2022

Uh oh!

github-actions bot commented Oct 11, 2022

Uh oh!

Mer1997 commented Mar 22, 2023

Uh oh!

batrick commented Mar 22, 2023

Uh oh!

Mer1997 commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mer1997 commented Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

batrick commented Oct 18, 2023

Uh oh!

vshankar commented Oct 19, 2023

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 20, 2023

Uh oh!

batrick commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

batrick commented Jul 15, 2022 •

edited

Loading

Mer1997 commented Mar 27, 2023 •

edited

Loading

Mer1997 commented Sep 26, 2023 •

edited

Loading