mds/MDLog: a large refactor by leonid-s-usov · Pull Request #53086 · ceph/ceph

leonid-s-usov · 2023-08-22T17:36:26Z

Make sure that any structure referring to a LogSegment is using a shared pointer to do that
Limit the API changes by enhancing the LogSegment class with an enable_shared_from_this base
- Make LogSegment::LogSegment private and expose a create method to correctly initialize the base class
- Introduce AutoSharedLogSegment helper class that recovers a shared pointer from a LogSegment*
Make the needed changes to MDRank and MDLog to make MDLog unit-testable
Large refactoring of the MDLog post the LogSegment changes

Fixes: https://tracker.ceph.com/issues/4744

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

batrick

My main concern with this PR is that it doesn't actually address:

"These really ought to be ref-counted in some way to prevent early expiry."

from https://tracker.ceph.com/issues/4744

It's not well laid out what the concern is but I suspect it's something like to prevent tickets like:

https://tracker.ceph.com/issues/59119

You could potentially resolve that ticket with this PR by checking that the LogSegment reference count is 1 before trimming* the segment.

I believe Greg actually meant "early trimming" as that's the real concern: that the log segment will be removed from the physical journal.

cc @gregsfortytwo

batrick · 2023-08-23T20:27:58Z

src/mds/LogEvent.h

 class MDSRank;
 class LogSegment;
 class EMetaBlob;
+using AutoSharedLogSegment = auto_shared_ptr<LogSegment>;


So I'm mystified by the utility of this class. Why couldn't it be:

using foo = std::shared_ptr<LogSegment>

?

As I mentioned in the commit message, the utility of this class is to automatically convert a pointer to an object that inherits from enable_shared_from_this into a shared pointer and also adds an implicit conversion from the shared pointer back to a plain pointer. This helps minimize refactoring.

I'll elaborate on how I believe this helps address the original ticket in a separate comment.

Normally we use boost::intrusive_ptr if other accessors have a raw pointer. That's somewhat inappropriate in a code base we fully control, the mds.

Your autoshared class is interesting though. I'll have to think about this more.

batrick · 2023-08-23T20:29:52Z

src/mds/LogSegment.h

+
+  [[nodiscard]] static AutoSharedLogSegment create(uint64_t _seq, loff_t off = -1)
+  {
+    return std::shared_ptr<LogSegment>(new LogSegment(_seq, off));


https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared

make_shared is inaccessible here because the constructor is private. See class Best() in the example from the documentation on shared_ptr

Additionally, make_shared comes at the cost of pinning the memory of the whole object when only weak pointers are left since the same memory block is used to hold the shared pointer control and the class data. This is mentioned in the second note on the documentation page you've referenced.
To clarify, I'm not sure this will be a blocker for us, and maybe we could even benefit from it, but I think it's an important point to keep in mind

make_shared is inaccessible here because the constructor is private. See class Best() in the example from the documentation on shared_ptr

ok

Additionally, make_shared comes at the cost of pinning the memory of the whole object when only weak pointers are left since the same memory block is used to hold the shared pointer control and the class data. This is mentioned in the second note on the documentation page you've referenced. To clarify, I'm not sure this will be a blocker for us, and maybe we could even benefit from it, but I think it's an important point to keep in mind

I don't really think that will be significant but I also wonder if a clever implementation is permitted to realloc the memory block when T is destroyed. What an odd caveat I had the opposite understanding about.

leonid-s-usov · 2023-08-23T21:00:51Z

My main concern with this PR is that it doesn't actually address:

"These really ought to be ref-counted in some way to prevent early expiry."

@batrick I do think that this PR addresses the original ticket. It does so by changing all containers to hold a shared pointer to the LogSegment instead of a plain pointer.

It wouldn't be sufficient to only update the storage semantics, we also needed to correctly create shared pointers to the same original object. To achieve this I decided to utilize the enable_shared_from_this interface. Basically, it embeds the weak pointer to the object in the object itself, so that whoever wants to take shared ownership of it can simply call shared_from_this() and hold on to the result.
I chose this option to minimize the changes to the code since most of the functions that receive a pointer to the LogSegment perform synchronous operations with the object and hence would not benefit from getting a shared pointer instead.

To further reduce the scope of refactoring I've introduced a helper class auto_shared_ptr which can implicitly convert an enabled plain pointer to a shared pointer and back.

You could potentially resolve that ticket with this PR by checking that the LogSegment reference count is 1 before trimming* the segment.

This point needs discussion, as I don't think this requirement is still applicable - though I'll be happy to learn if I'm wrong about this.

At the time the original ticket was created, mds trimming logic was different - and simpler. Back then trimming was done synchronously, but now trimming is staged through

  std::set<AutoSharedLogSegment> expired_segments;
  std::set<AutoSharedLogSegment> expiring_segments;

I haven't yet fully understood the logic, but it looks like there's already an asynchronous process that only trims segments that were reported as unused.

batrick · 2023-08-24T01:58:57Z

My main concern with this PR is that it doesn't actually address:
"These really ought to be ref-counted in some way to prevent early expiry."

@batrick I do think that this PR addresses the original ticket. It does so by changing all containers to hold a shared pointer to the LogSegment instead of a plain pointer.

You've refcounted LogSegment but you haven't used that to prevent early expiry. Specifically, to address that I think you need to change:

ceph/src/mds/MDLog.cc

Lines 842 to 851 in ce9771b

    
           dout(20) << __func__ << ": expiring " << *ls2 << dendl; 
        
           expired_events -= ls2->num_events; 
        
           expired_segments.erase(ls2); 
        
           if (pre_segments_size > 0) 
        
             pre_segments_size--; 
        
           num_events -= ls2->num_events; 
        
           logger->inc(l_mdl_evtrm, ls2->num_events); 
        
           logger->inc(l_mdl_segtrm); 
        
           expire_pos = ls2->end; 
        
           delete ls2;

to not trim (i.e. delete) a segment unless its refcount is 1.

src/mds/LogSegment.h

leonid-s-usov · 2023-08-24T08:07:57Z

to not trim (i.e. delete) a segment unless its refcount is 1.

Well, I don't need to do anything for the log segment to be deleted only when the last strong reference to it is gone, that's the primary function of a shared pointer. In other words, if the delete operation is what we'd like to delay - that's already the case with the current implementation.

leonid-s-usov · 2023-09-11T12:01:24Z

@batrick @vshankar I ended up changing quite a bit. Unfortunately, there is no unittest infrastructure for MDLog, but it's worth investing in it. Most if not all of my changes can be relatively quickly verified with unit tests, and MDLog does not have that many external dependencies compared to other core classes.

I've changed this PR to a draft so that I can upload the current version which compiles, and have you go over it and start giving overall comments. If we want to proceed with this, I will have another commit where I make MDLog constructible in a unit test environment and start adding unit tests to verify the invariants that I have most probably violated

For details about the changes please see the commit message

gregsfortytwo · 2023-09-12T15:38:47Z

Specific questions to look at here:

the locking changes around the submit lock, and taking the mds lock in the thread
ambiguity around event counts (live versus expiring versus expired) versus health checks

batrick · 2023-09-12T16:02:03Z

src/mds/MDLog.cc

-	logger->set(l_mdl_wrpos, ls->end);
+      // acquire the mds lock until the end of the scope
+      // to update the LS offsets and major offsets
+      std::lock_guard l(mds->mds_lock);


I still need to go through these changes in more detail but this will not fly. The mds_lock has too much thread contention already. Also, the submit thread is intended to operate completely outside the mds_lock. I do not think we should ever change that.

I believe there may be another way to protect the LogSegment members behind the mds_lock: add the LogSegment to C_MDL_Flushed and update the segment's end/offset in the completion (::finish).

C_MDL_Flushed will take mds_lock, so not much difference in terms of contention. But I agree that we could benefit from performing the update on the finisher thread instead of the submit thread.

As you said, the journaler thread already regularly acquires the mds_lock so it's advantageous to use it for this purpose instead of adding a new thread to the contention.

Please see a1c9d1a, I removed the need for locking.

allowing move construction and assignment. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

This will allow generic code easily identify whether a given MDSContext instance will take the mds lock or it expects to be called with the lock taken. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

damaged_unlocked is another method in the same class that has the same semantics but a better name, so now hande_write_error_unlocked will take the mds lock before executing handle_write_error. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

* Make sure that any structure referring to a LogSegment is using a shared pointer to do that * Limit the API changes by enhancing the LogSegment class with an `enable_shared_from_this` base - Make LogSegment::LogSegment private and expose a `create` method to correctly initialize the base class - Introduce AutoSharedLogSegment helper class that recovers a shared pointer from a LogSegment* Signed-off-by: Leonid Usov <leonid.usov@ibm.com> Fixes: https://tracker.ceph.com/issues/4744

The main goal of this change is to keep expired segments as weak references, allowing to spot "zombie" segments which should've been deleted by the time we decide to trim them. Since it's not a fatal error to have a dangling log segment given that it was properly expired, the system will ignore zombies unless a dedicated new debug flag is enabled: - name: mds_debug_zombie_log_segments type: bool level: advanced default: false services: - mds fmt_desc: MDS will abort if an expired log segment is still strongly referenced (for developers only). To achieve this it was necessary to change how segments are kept in the log. Specifically, it was necessary to break up the `segments` map into two: unexpired_segments and expired_segments, where the latter holds structs of ExpiredSegmentInfo with a weak ref to the original log segment, as well as some importany by-value copies for accounting and syncronization. Since the change already required significant refactoring, it was used to deliver a number of additional improvements: * expiring segments don't need a separate container, they are now kept in `unexpired_segments` until they complete expiration. The expiration boundary is kept in a separate variable `last_expiring_segment_seq` which is sufficient. * submit_mutex is now used exclusively to protect the `pending_events` when sending them to the submit thread. All other internal state is already protected by the mds_lock * expiration waiters are now added as a single context associated with a segment sequence, which will be finished when no more unexpired segments exists with the sequence or below. * expiration and trimming logic were optimized to make as much use of the O(log N) lookup in the std containers as possible. * methods and fields were renamed to reduce ambiguity Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

leonid-s-usov · 2024-06-25T20:09:36Z

https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/
This run has one failure, caused by the zombie segment detection, new in this PR. It means a log segment is still kept somewhere as a strong reference despite having expired. To be investigated.

leonid-s-usov · 2024-06-26T05:13:49Z

https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/ This run has one failure, caused by the zombie segment detection, new in this PR. It means a log segment is still kept somewhere as a strong reference despite having expired. To be investigated.

There are 2 more failures. xfstests-dev, which I am re-running now to see if this is easily reproduced, and a crash in the trimming. That one I'll address today

leonid-s-usov · 2024-06-26T08:47:56Z

https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/ This run has one failure, caused by the zombie segment detection, new in this PR. It means a log segment is still kept somewhere as a strong reference despite having expired. To be investigated.

Correction:
there are two failures related to the zombie detection which is enabled by the new dedicated dev setting: mds_debug_zombie_log_segments

And there are xfs failures here:
https://pulpito.ceph.com/leonidus-2024-06-25_18:06:42-fs-wip-lusov-mdlog-distro-default-smithi/7771785/

leonid-s-usov · 2024-06-26T08:50:56Z

I'm running 10 more instances of the xfs tests here: https://pulpito.ceph.com/leonidus-2024-06-26_05:12:19-fs-wip-lusov-mdlog-distro-default-smithi/

leonid-s-usov · 2024-06-26T08:55:35Z

@vshankar known issues at this point:

the new zombie detection setting (see above) is not supported in upgrade workloads because the previous version doesn't recognize the setting. I believe there's a standard approach to dealing with this kind of problem when new settings are introduced
with the zombie detection enabled, some tests cause MDS crashes with FAILED ceph_assert(ignore_zombies). It could be a real leak, but it could also be some internal problem where a strong ref is kept somewhere in the MDLog itself.
Not yet sure if the XFS test failures are related to this PR

I haven't scheduled a full fs suite on this change because it is not even started with priority 100.

github-actions · 2024-07-16T04:35:22Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions · 2024-09-14T05:01:26Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-10-14T06:02:07Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

github-actions · 2024-12-13T13:08:12Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2025-01-12T16:01:30Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

leonid-s-usov requested review from batrick and vshankar August 22, 2023 17:36

github-actions bot added the cephfs Ceph File System label Aug 22, 2023

leonid-s-usov requested a review from a team August 22, 2023 17:37

leonid-s-usov force-pushed the shared-log-segment branch 9 times, most recently from 430d867 to 57df0f1 Compare August 23, 2023 20:30

github-actions bot added build/ops tests labels Aug 23, 2023

batrick requested changes Aug 23, 2023

View reviewed changes

batrick reviewed Aug 24, 2023

View reviewed changes

src/mds/LogSegment.h Show resolved Hide resolved

leonid-s-usov force-pushed the shared-log-segment branch from 57df0f1 to f50344f Compare August 24, 2023 18:37

leonid-s-usov changed the title ~~mds: manage the lifetime of a LogSegment with shared pointers~~ mds/MDLog: manage the lifetime of a LogSegment with shared pointers Sep 11, 2023

leonid-s-usov marked this pull request as draft September 11, 2023 09:05

leonid-s-usov force-pushed the shared-log-segment branch from f50344f to 03faff6 Compare September 11, 2023 11:50

github-actions bot added the common label Sep 11, 2023

leonid-s-usov force-pushed the shared-log-segment branch from 03faff6 to 47028d4 Compare September 11, 2023 11:55

leonid-s-usov force-pushed the shared-log-segment branch from 47028d4 to dc9d658 Compare September 11, 2023 16:27

batrick requested changes Sep 12, 2023

View reviewed changes

leonid-s-usov added 12 commits June 25, 2024 18:22

common: add missing methods to ClibBackTrace

4ba5c45

allowing move construction and assignment. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/MDSContext: add a takes_lock() method

1a3d0be

This will allow generic code easily identify whether a given MDSContext instance will take the mds lock or it expects to be called with the lock taken. Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/MDSRank: introduce MDSRankBase for dependency injection

098f9f8

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds,osdc: make Journaler an interface

3803b7c

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

test/mds: a minimal in-memory Journaler implementation

88c9cfc

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/journal: extract event serializers

b4a5d7d

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/MDCache: a cache proxy for MDLog

6b6d363

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

test/mds: MDLog unittest

f78bf32

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/MDLog: fix the boundary condition in await_expiring_segments

5dfe9b5

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

leonid-s-usov force-pushed the shared-log-segment branch from a26b130 to 5dfe9b5 Compare June 25, 2024 15:23

github-actions bot added the needs-rebase label Jul 16, 2024

github-actions bot added the stale label Sep 14, 2024

github-actions bot closed this Oct 14, 2024

dparmar18 reopened this Oct 14, 2024

github-actions bot removed the stale label Oct 14, 2024

github-actions bot added the stale label Dec 13, 2024

github-actions bot closed this Jan 12, 2025

Conversation

leonid-s-usov commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov commented Aug 23, 2023

Uh oh!

batrick commented Aug 24, 2023

Uh oh!

Uh oh!

leonid-s-usov commented Aug 24, 2023

Uh oh!

leonid-s-usov commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregsfortytwo commented Sep 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov commented Jun 25, 2024

Uh oh!

leonid-s-usov commented Jun 26, 2024

Uh oh!

leonid-s-usov commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonid-s-usov commented Jun 26, 2024

Uh oh!

leonid-s-usov commented Jun 26, 2024

Uh oh!

github-actions bot commented Jul 16, 2024

Uh oh!

github-actions bot commented Sep 14, 2024

Uh oh!

github-actions bot commented Oct 14, 2024

Uh oh!

github-actions bot commented Dec 13, 2024

Uh oh!

github-actions bot commented Jan 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leonid-s-usov commented Aug 22, 2023 •

edited

Loading

leonid-s-usov Aug 23, 2023 •

edited

Loading

leonid-s-usov commented Sep 11, 2023 •

edited

Loading

leonid-s-usov commented Jun 26, 2024 •

edited

Loading