Skip to content

mds/MDLog: a large refactor#53086

Closed
leonid-s-usov wants to merge 12 commits intoceph:mainfrom
leonid-s-usov:shared-log-segment
Closed

mds/MDLog: a large refactor#53086
leonid-s-usov wants to merge 12 commits intoceph:mainfrom
leonid-s-usov:shared-log-segment

Conversation

@leonid-s-usov
Copy link
Contributor

@leonid-s-usov leonid-s-usov commented Aug 22, 2023

  • Make sure that any structure referring to a LogSegment is using a shared pointer to do that
  • Limit the API changes by enhancing the LogSegment class with an enable_shared_from_this base
    • Make LogSegment::LogSegment private and expose a create method to correctly initialize the base class
    • Introduce AutoSharedLogSegment helper class that recovers a shared pointer from a LogSegment*
  • Make the needed changes to MDRank and MDLog to make MDLog unit-testable
  • Large refactoring of the MDLog post the LogSegment changes

Fixes: https://tracker.ceph.com/issues/4744

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the cephfs Ceph File System label Aug 22, 2023
@leonid-s-usov leonid-s-usov requested a review from a team August 22, 2023 17:37
@leonid-s-usov leonid-s-usov force-pushed the shared-log-segment branch 9 times, most recently from 430d867 to 57df0f1 Compare August 23, 2023 20:30
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this PR is that it doesn't actually address:

"These really ought to be ref-counted in some way to prevent early expiry."

from https://tracker.ceph.com/issues/4744

It's not well laid out what the concern is but I suspect it's something like to prevent tickets like:

https://tracker.ceph.com/issues/59119

You could potentially resolve that ticket with this PR by checking that the LogSegment reference count is 1 before trimming* the segment.

  • I believe Greg actually meant "early trimming" as that's the real concern: that the log segment will be removed from the physical journal.

cc @gregsfortytwo

class MDSRank;
class LogSegment;
class EMetaBlob;
using AutoSharedLogSegment = auto_shared_ptr<LogSegment>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm mystified by the utility of this class. Why couldn't it be:

using foo = std::shared_ptr<LogSegment>

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the commit message, the utility of this class is to automatically convert a pointer to an object that inherits from enable_shared_from_this into a shared pointer and also adds an implicit conversion from the shared pointer back to a plain pointer. This helps minimize refactoring.

I'll elaborate on how I believe this helps address the original ticket in a separate comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we use boost::intrusive_ptr if other accessors have a raw pointer. That's somewhat inappropriate in a code base we fully control, the mds.

Your autoshared class is interesting though. I'll have to think about this more.


[[nodiscard]] static AutoSharedLogSegment create(uint64_t _seq, loff_t off = -1)
{
return std::shared_ptr<LogSegment>(new LogSegment(_seq, off));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@leonid-s-usov leonid-s-usov Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_shared is inaccessible here because the constructor is private. See class Best() in the example from the documentation on shared_ptr

Additionally, make_shared comes at the cost of pinning the memory of the whole object when only weak pointers are left since the same memory block is used to hold the shared pointer control and the class data. This is mentioned in the second note on the documentation page you've referenced.
To clarify, I'm not sure this will be a blocker for us, and maybe we could even benefit from it, but I think it's an important point to keep in mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_shared is inaccessible here because the constructor is private. See class Best() in the example from the documentation on shared_ptr

ok

Additionally, make_shared comes at the cost of pinning the memory of the whole object when only weak pointers are left since the same memory block is used to hold the shared pointer control and the class data. This is mentioned in the second note on the documentation page you've referenced. To clarify, I'm not sure this will be a blocker for us, and maybe we could even benefit from it, but I think it's an important point to keep in mind

I don't really think that will be significant but I also wonder if a clever implementation is permitted to realloc the memory block when T is destroyed. What an odd caveat I had the opposite understanding about.

@leonid-s-usov
Copy link
Contributor Author

My main concern with this PR is that it doesn't actually address:

"These really ought to be ref-counted in some way to prevent early expiry."

@batrick I do think that this PR addresses the original ticket. It does so by changing all containers to hold a shared pointer to the LogSegment instead of a plain pointer.

It wouldn't be sufficient to only update the storage semantics, we also needed to correctly create shared pointers to the same original object. To achieve this I decided to utilize the enable_shared_from_this interface. Basically, it embeds the weak pointer to the object in the object itself, so that whoever wants to take shared ownership of it can simply call shared_from_this() and hold on to the result.
I chose this option to minimize the changes to the code since most of the functions that receive a pointer to the LogSegment perform synchronous operations with the object and hence would not benefit from getting a shared pointer instead.

To further reduce the scope of refactoring I've introduced a helper class auto_shared_ptr which can implicitly convert an enabled plain pointer to a shared pointer and back.

You could potentially resolve that ticket with this PR by checking that the LogSegment reference count is 1 before trimming* the segment.

This point needs discussion, as I don't think this requirement is still applicable - though I'll be happy to learn if I'm wrong about this.

At the time the original ticket was created, mds trimming logic was different - and simpler. Back then trimming was done synchronously, but now trimming is staged through

  std::set<AutoSharedLogSegment> expired_segments;
  std::set<AutoSharedLogSegment> expiring_segments;

I haven't yet fully understood the logic, but it looks like there's already an asynchronous process that only trims segments that were reported as unused.

@batrick
Copy link
Member

batrick commented Aug 24, 2023

My main concern with this PR is that it doesn't actually address:
"These really ought to be ref-counted in some way to prevent early expiry."

@batrick I do think that this PR addresses the original ticket. It does so by changing all containers to hold a shared pointer to the LogSegment instead of a plain pointer.

You've refcounted LogSegment but you haven't used that to prevent early expiry. Specifically, to address that I think you need to change:

ceph/src/mds/MDLog.cc

Lines 842 to 851 in ce9771b

dout(20) << __func__ << ": expiring " << *ls2 << dendl;
expired_events -= ls2->num_events;
expired_segments.erase(ls2);
if (pre_segments_size > 0)
pre_segments_size--;
num_events -= ls2->num_events;
logger->inc(l_mdl_evtrm, ls2->num_events);
logger->inc(l_mdl_segtrm);
expire_pos = ls2->end;
delete ls2;

to not trim (i.e. delete) a segment unless its refcount is 1.

@leonid-s-usov
Copy link
Contributor Author

to not trim (i.e. delete) a segment unless its refcount is 1.

Well, I don't need to do anything for the log segment to be deleted only when the last strong reference to it is gone, that's the primary function of a shared pointer. In other words, if the delete operation is what we'd like to delay - that's already the case with the current implementation.

@leonid-s-usov leonid-s-usov changed the title mds: manage the lifetime of a LogSegment with shared pointers mds/MDLog: manage the lifetime of a LogSegment with shared pointers Sep 11, 2023
@leonid-s-usov leonid-s-usov marked this pull request as draft September 11, 2023 09:05
@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Sep 11, 2023

@batrick @vshankar I ended up changing quite a bit. Unfortunately, there is no unittest infrastructure for MDLog, but it's worth investing in it. Most if not all of my changes can be relatively quickly verified with unit tests, and MDLog does not have that many external dependencies compared to other core classes.

I've changed this PR to a draft so that I can upload the current version which compiles, and have you go over it and start giving overall comments. If we want to proceed with this, I will have another commit where I make MDLog constructible in a unit test environment and start adding unit tests to verify the invariants that I have most probably violated

For details about the changes please see the commit message

@gregsfortytwo
Copy link
Member

Specific questions to look at here:

  1. the locking changes around the submit lock, and taking the mds lock in the thread
  2. ambiguity around event counts (live versus expiring versus expired) versus health checks

src/mds/MDLog.cc Outdated
logger->set(l_mdl_wrpos, ls->end);
// acquire the mds lock until the end of the scope
// to update the LS offsets and major offsets
std::lock_guard l(mds->mds_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to go through these changes in more detail but this will not fly. The mds_lock has too much thread contention already. Also, the submit thread is intended to operate completely outside the mds_lock. I do not think we should ever change that.

I believe there may be another way to protect the LogSegment members behind the mds_lock: add the LogSegment to C_MDL_Flushed and update the segment's end/offset in the completion (::finish).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C_MDL_Flushed will take mds_lock, so not much difference in terms of contention. But I agree that we could benefit from performing the update on the finisher thread instead of the submit thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, the journaler thread already regularly acquires the mds_lock so it's advantageous to use it for this purpose instead of adding a new thread to the contention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see a1c9d1a, I removed the need for locking.

allowing move construction and assignment.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
This will allow generic code easily identify whether a given
MDSContext instance will take the mds lock or it expects
to be called with the lock taken.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
damaged_unlocked is another method in the same class that has
the same semantics but a better name, so now
hande_write_error_unlocked will take the mds lock before
executing handle_write_error.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
* Make sure that any structure referring to a LogSegment is using a shared pointer to do that
* Limit the API changes by enhancing the LogSegment class with an `enable_shared_from_this` base
  - Make LogSegment::LogSegment private and expose a `create` method to correctly initialize the base class
  - Introduce AutoSharedLogSegment helper class that recovers a shared pointer from a LogSegment*

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Fixes: https://tracker.ceph.com/issues/4744
The main goal of this change is to keep expired segments as weak references,
allowing to spot "zombie" segments which should've been deleted by the time
we decide to trim them.
Since it's not a fatal error to have a dangling log segment given that it was
properly expired, the system will ignore zombies unless a dedicated new
debug flag is enabled:

- name: mds_debug_zombie_log_segments
  type: bool
  level: advanced
  default: false
  services:
  - mds
  fmt_desc: MDS will abort if an expired log segment is still
            strongly referenced (for developers only).

To achieve this it was necessary to change how segments are kept in the log.
Specifically, it was necessary to break up the `segments` map into two:
unexpired_segments and expired_segments, where the latter holds structs of
ExpiredSegmentInfo with a weak ref to the original log segment,
as well as some importany by-value copies for accounting and syncronization.

Since the change already required significant refactoring, it was used to deliver
a number of additional improvements:

* expiring segments don't need a separate container, they are now kept in `unexpired_segments`
  until they complete expiration. The expiration boundary is kept in a separate
  variable `last_expiring_segment_seq` which is sufficient.
* submit_mutex is now used exclusively to protect the `pending_events` when sending them
  to the submit thread. All other internal state is already protected by the mds_lock
* expiration waiters are now added as a single context associated with a segment sequence,
  which will be finished when no more unexpired segments exists with the sequence or below.
* expiration and trimming logic were optimized to make as much use of the O(log N) lookup
  in the std containers as possible.
* methods and fields were renamed to reduce ambiguity

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@leonid-s-usov
Copy link
Contributor Author

https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/
This run has one failure, caused by the zombie segment detection, new in this PR. It means a log segment is still kept somewhere as a strong reference despite having expired. To be investigated.

@leonid-s-usov
Copy link
Contributor Author

https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/ This run has one failure, caused by the zombie segment detection, new in this PR. It means a log segment is still kept somewhere as a strong reference despite having expired. To be investigated.

There are 2 more failures. xfstests-dev, which I am re-running now to see if this is easily reproduced, and a crash in the trimming. That one I'll address today

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Jun 26, 2024

https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/ This run has one failure, caused by the zombie segment detection, new in this PR. It means a log segment is still kept somewhere as a strong reference despite having expired. To be investigated.

Correction:
there are two failures related to the zombie detection which is enabled by the new dedicated dev setting: mds_debug_zombie_log_segments

And there are xfs failures here:
https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/7771785/

@leonid-s-usov
Copy link
Contributor Author

I'm running 10 more instances of the xfs tests here: https://pulpito.ceph.com/leonidus-2024-06-26_05:12:19-fs-wip-lusov-mdlog-distro-default-smithi/

@leonid-s-usov
Copy link
Contributor Author

@vshankar known issues at this point:

  • the new zombie detection setting (see above) is not supported in upgrade workloads because the previous version doesn't recognize the setting. I believe there's a standard approach to dealing with this kind of problem when new settings are introduced
  • with the zombie detection enabled, some tests cause MDS crashes with FAILED ceph_assert(ignore_zombies). It could be a real leak, but it could also be some internal problem where a strong ref is kept somewhere in the MDLog itself.
  • Not yet sure if the XFS test failures are related to this PR

I haven't scheduled a full fs suite on this change because it is not even started with priority 100.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 14, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 14, 2024
@dparmar18 dparmar18 reopened this Oct 14, 2024
@github-actions github-actions bot removed the stale label Oct 14, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 13, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants