mds: Fix incorrect Importer MDS recovery during EImportStart journal replay#36227
mds: Fix incorrect Importer MDS recovery during EImportStart journal replay#36227sidharthanup wants to merge 1 commit intoceph:mainfrom
Conversation
src/mds/Server.cc
Outdated
| << " seconds during MDS startup"; | ||
|
|
||
| if (g_conf()->mds_session_blacklist_on_timeout) { | ||
| if (g_conf()->mds_session_blacklist_on_timeout && !session->is_importing()) { |
There was a problem hiding this comment.
Why not? In Server::prepare_force_open_sessions(), the count gets incremented (https://github.com/sidharthanup/ceph/blob/wip-fix-import-recovery/src/mds/Server.cc#L937) . Then this session is persisted in EImportStart. During replay when sessionmap.replay_open_sessions() is called this information is relayed and then Server::reconnect_tick is going to see this and kill it instead of blacklisting.
There was a problem hiding this comment.
import_count is always 0 after log replay.
There was a problem hiding this comment.
Ack. Thanks @ukernel. I have updated the code to set a flag on EImportStart replay to avoid blacklisting it. Yet to test it but don't you think this should work?
26444a4 to
13a4197
Compare
src/mds/Server.cc
Outdated
| << " seconds during MDS startup"; | ||
|
|
||
| if (g_conf()->mds_session_blacklist_on_timeout) { | ||
| if (g_conf()->mds_session_blacklist_on_timeout && !session->is_import_replayed()) { |
There was a problem hiding this comment.
for this implementation, any session opened by replay_open_sessions will not get blacklisteded. It's not good.
There was a problem hiding this comment.
It will not get blacklisted only when replay_open_sessions is called in EImportStart journal replay: https://github.com/ceph/ceph/pull/36227/files#diff-e18d1bb1d5f6388430b7705cf32eac6dR2991, due to import_replayed boolean being set to true. In this scenario either 2 things happen: the session was not actually opened by the client (before the MDS failed) and so it gets killed (instead of blacklisted) during recconnect, or if the client had opened the session before the MDS went down then during reconnect the connection is established again and the boolean is set back to false: https://github.com/ceph/ceph/pull/36227/files#diff-eecf840e09d7481f7b8f415789852c7dR1501.
Is there an issue that you see with this approach?
There was a problem hiding this comment.
needs to handle the case that client has actually opened session before MDS went down and the client failed to reconnect
There was a problem hiding this comment.
@ukernel Tried working on this today but I'm still not sure how to handle the above case (the session still gets killed, but not blacklisted, which is what I think you mean should happen). Problem is the journalling for the opening of connection from the MDS side (ESession) comes after EImportStart is journalled. Now during failover journal replay, the MDS at EImportStart replay does not have any information on whether the session was really opened by the client or not. That information is in the next ESession replay. Any thoughts on how to handle this? I may be missing something here.
There was a problem hiding this comment.
for any session opened by EImportStart, set a client_not_open flag. If ESession is countered later, clear the flag. If client does not reconnect to session with the flag, just kill the session (don't blacklist)
There was a problem hiding this comment.
@ukernel Thanks! I missed your comment before. Clearing the flag on ESession seems like a good fix. I have updated the code.
|
status on this? |
|
Will update by the weekend. Was slightly busy with the other pending work. |
60467f2 to
5c807f9
Compare
5c807f9 to
775e281
Compare
| session = mds->sessionmap.get_or_add_session(client_inst); | ||
| mds->sessionmap.set_state(session, Session::STATE_OPEN); | ||
| session->set_client_metadata(client_metadata); | ||
| session->set_client_not_open(false); |
There was a problem hiding this comment.
no code submits the ESession event
There was a problem hiding this comment.
Here It's just clearing the client_not_open flag set during EImportStart replay. Meaning the client had actually opened the connection and so we don't need this flag to be set. So during reconnect if the flag is set and the client failed to reconnect, then we kill the session and if it is not set and client failed to reconnect, then blacklist.
There was a problem hiding this comment.
check Server::handle_client_session(). if session is already being opened by Server::prepare_force_open_sessions. it does not submit ESession event
775e281 to
c860bc5
Compare
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@sidharthanup status? |
c860bc5 to
b5bf0a6
Compare
b5bf0a6 to
240bfbd
Compare
|
jenkins test make check |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
240bfbd to
371abc4
Compare
|
jenkins test make check |
7ff106a to
06e120e
Compare
…replay https://tracker.ceph.com/issues/46535 Signed-off-by: Sidharth Anupkrishnan <sanupkri@redhat.com>
06e120e to
cab9792
Compare
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Signed-off-by: Sidharth Anupkrishnan sanupkri@redhat.com
Fixes: https://tracker.ceph.com/issues/46535
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard backendjenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox