Skip to content

quincy: mds: relax certain asserts in mdlog replay thread#56015

Merged
vshankar merged 1 commit intoceph:quincyfrom
vshankar:wip-64760-quincy
Nov 28, 2024
Merged

quincy: mds: relax certain asserts in mdlog replay thread#56015
vshankar merged 1 commit intoceph:quincyfrom
vshankar:wip-64760-quincy

Conversation

@vshankar
Copy link
Contributor

@vshankar vshankar commented Mar 7, 2024

backport tracker: https://tracker.ceph.com/issues/64760


backport of #55639
parent tracker: https://tracker.ceph.com/issues/57048

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

The calls to journaler->is_readable() and journaler->get_error()
in MDLog::_replay_thread() will drop Journaler::lock between
invocations, so, theoretically, its possible that the initial check:

  // loop
  int r = 0;
  while (1) {
    // wait for read?
    while (!journaler->is_readable() &&
       journaler->get_read_pos() < journaler->get_write_pos() &&
       !journaler->get_error()) {
      C_SaferCond readable_waiter;
      journaler->wait_for_readable(&readable_waiter);
      r = readable_waiter.wait();
    }
    if (journaler->get_error()) {
      r = journaler->get_error();
      dout(0) << "_replay journaler got error " << r << ", aborting" << dendl;

journaler->is_readable() returned true, thereby breaking out of
the (inner) while loop and by passing the journaler->get_error()
check, and by the time this hits the next set of checks:

    if (!journaler->is_readable() &&
    journaler->get_read_pos() == journaler->get_write_pos())
      break;

    ceph_assert(journaler->is_readable() || mds->is_daemon_stopping());

It's possible that the journal is unreadable due to some error that
happened during prefetch. In short, these checks are racy.

So, remove these racy assert check along with journaler->is_readable()
check when validating the journal end and rely on the next iteration
of reading the journal for error handling.

Fixes: http://tracker.ceph.com/issues/57048
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 90393de)
@vshankar vshankar added this to the quincy milestone Mar 7, 2024
@vshankar vshankar added the cephfs Ceph File System label Mar 7, 2024
@joscollin joscollin requested a review from a team June 20, 2024 09:56
@joscollin
Copy link
Member

This PR is under test in https://tracker.ceph.com/issues/66597.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Nov 26, 2024
@vshankar vshankar removed the stale label Nov 28, 2024
Copy link
Contributor Author

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit f2e97b2 into ceph:quincy Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants