Skip to content

mds: Fix incorrect Importer MDS recovery during EImportStart journal replay#36227

Closed
sidharthanup wants to merge 1 commit intoceph:mainfrom
sidharthanup:wip-fix-import-recovery
Closed

mds: Fix incorrect Importer MDS recovery during EImportStart journal replay#36227
sidharthanup wants to merge 1 commit intoceph:mainfrom
sidharthanup:wip-fix-import-recovery

Conversation

@sidharthanup
Copy link
Contributor

Signed-off-by: Sidharth Anupkrishnan sanupkri@redhat.com
Fixes: https://tracker.ceph.com/issues/46535

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@sidharthanup sidharthanup self-assigned this Jul 21, 2020
@sidharthanup sidharthanup added the cephfs Ceph File System label Jul 21, 2020
@sidharthanup sidharthanup requested review from batrick and ukernel July 21, 2020 13:16
<< " seconds during MDS startup";

if (g_conf()->mds_session_blacklist_on_timeout) {
if (g_conf()->mds_session_blacklist_on_timeout && !session->is_importing()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? In Server::prepare_force_open_sessions(), the count gets incremented (https://github.com/sidharthanup/ceph/blob/wip-fix-import-recovery/src/mds/Server.cc#L937) . Then this session is persisted in EImportStart. During replay when sessionmap.replay_open_sessions() is called this information is relayed and then Server::reconnect_tick is going to see this and kill it instead of blacklisting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import_count is always 0 after log replay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Thanks @ukernel. I have updated the code to set a flag on EImportStart replay to avoid blacklisting it. Yet to test it but don't you think this should work?

@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch from 26444a4 to 13a4197 Compare July 28, 2020 12:14
<< " seconds during MDS startup";

if (g_conf()->mds_session_blacklist_on_timeout) {
if (g_conf()->mds_session_blacklist_on_timeout && !session->is_import_replayed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this implementation, any session opened by replay_open_sessions will not get blacklisteded. It's not good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not get blacklisted only when replay_open_sessions is called in EImportStart journal replay: https://github.com/ceph/ceph/pull/36227/files#diff-e18d1bb1d5f6388430b7705cf32eac6dR2991, due to import_replayed boolean being set to true. In this scenario either 2 things happen: the session was not actually opened by the client (before the MDS failed) and so it gets killed (instead of blacklisted) during recconnect, or if the client had opened the session before the MDS went down then during reconnect the connection is established again and the boolean is set back to false: https://github.com/ceph/ceph/pull/36227/files#diff-eecf840e09d7481f7b8f415789852c7dR1501.

Is there an issue that you see with this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to handle the case that client has actually opened session before MDS went down and the client failed to reconnect

Copy link
Contributor Author

@sidharthanup sidharthanup Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ukernel Tried working on this today but I'm still not sure how to handle the above case (the session still gets killed, but not blacklisted, which is what I think you mean should happen). Problem is the journalling for the opening of connection from the MDS side (ESession) comes after EImportStart is journalled. Now during failover journal replay, the MDS at EImportStart replay does not have any information on whether the session was really opened by the client or not. That information is in the next ESession replay. Any thoughts on how to handle this? I may be missing something here.

Copy link
Contributor

@ukernel ukernel Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for any session opened by EImportStart, set a client_not_open flag. If ESession is countered later, clear the flag. If client does not reconnect to session with the flag, just kill the session (don't blacklist)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ukernel Thanks! I missed your comment before. Clearing the flag on ESession seems like a good fix. I have updated the code.

@batrick
Copy link
Member

batrick commented Aug 18, 2020

status on this?

@sidharthanup
Copy link
Contributor Author

Will update by the weekend. Was slightly busy with the other pending work.

@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch 2 times, most recently from 60467f2 to 5c807f9 Compare September 22, 2020 12:25
@sidharthanup
Copy link
Contributor Author

@ukernel @batrick PTAL.

@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch from 5c807f9 to 775e281 Compare September 22, 2020 12:33
session = mds->sessionmap.get_or_add_session(client_inst);
mds->sessionmap.set_state(session, Session::STATE_OPEN);
session->set_client_metadata(client_metadata);
session->set_client_not_open(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no code submits the ESession event

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here It's just clearing the client_not_open flag set during EImportStart replay. Meaning the client had actually opened the connection and so we don't need this flag to be set. So during reconnect if the flag is set and the client failed to reconnect, then we kill the session and if it is not set and client failed to reconnect, then blacklist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check Server::handle_client_session(). if session is already being opened by Server::prepare_force_open_sessions. it does not submit ESession event

@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch from 775e281 to c860bc5 Compare September 22, 2020 16:44
@stale
Copy link

stale bot commented Dec 4, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Dec 4, 2020
@batrick
Copy link
Member

batrick commented Dec 4, 2020

@sidharthanup status?

@stale stale bot removed the stale label Dec 4, 2020
@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch from c860bc5 to b5bf0a6 Compare January 27, 2021 08:54
@sidharthanup sidharthanup removed the DNM label Jan 27, 2021
@sidharthanup sidharthanup requested a review from a team January 27, 2021 08:56
@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch from b5bf0a6 to 240bfbd Compare January 27, 2021 11:02
@sidharthanup
Copy link
Contributor Author

jenkins test make check

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@sidharthanup
Copy link
Contributor Author

jenkins test make check

@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch 2 times, most recently from 7ff106a to 06e120e Compare April 22, 2021 13:03
@sidharthanup sidharthanup force-pushed the wip-fix-import-recovery branch from 06e120e to cab9792 Compare April 22, 2021 13:54
@stale
Copy link

stale bot commented Jul 21, 2021

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 21, 2021
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@djgalloway djgalloway changed the base branch from master to main July 8, 2022 00:00
@github-actions github-actions bot removed the stale label Jul 15, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 13, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants