Skip to content

mds: add minor segment boundaries#48732

Merged
vshankar merged 30 commits intoceph:mainfrom
batrick:mds-segment-event
Aug 2, 2023
Merged

mds: add minor segment boundaries#48732
vshankar merged 30 commits intoceph:mainfrom
batrick:mds-segment-event

Conversation

@batrick
Copy link
Member

@batrick batrick commented Nov 4, 2022

https://tracker.ceph.com/issues/58154

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick batrick added the cephfs Ceph File System label Nov 4, 2022
@batrick
Copy link
Member Author

batrick commented Nov 4, 2022

@vshankar @gregsfortytwo here's a draft prototype for avoiding ESubtreeMap for every LogSegment. Still some polishing / testing / documentation to do.

@batrick
Copy link
Member Author

batrick commented Nov 8, 2022

change+rebase: fixed a debug message

static void generate_test_instances(std::list<EMetaBlob*>& ls);
// soft stateadd
uint64_t last_subtree_map;
uint64_t event_seq;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit looks like a nice cleanup (+ enhancement). I wonder why EMetaBlob tracked event_seq in its own structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a hack to see if a dir inode was already added to the EMetaBlob.

@vshankar
Copy link
Contributor

LGTM

@batrick
Copy link
Member Author

batrick commented Nov 30, 2022

rebase plus minor changes:

diff --git a/src/mds/MDLog.cc b/src/mds/MDLog.cc
index 60756591726..9f1be3ba3f5 100644
--- a/src/mds/MDLog.cc
+++ b/src/mds/MDLog.cc
@@ -438,8 +438,7 @@ void MDLog::_submit_thread()
       delete le;
     } else {
       if (data.fin) {
-       Context* fin =
-               dynamic_cast<MDSContext*>(data.fin);
+       Context* fin = dynamic_cast<Context*>(data.fin);
        ceph_assert(fin);
        C_MDL_Flushed *fin2 = new C_MDL_Flushed(this, fin);
        fin2->set_write_pos(journaler->get_write_pos());
@@ -1396,7 +1395,8 @@ void MDLog::_replay_thread()
         continue;
       } else {
         mds->damaged_unlocked();
-        ceph_abort();  // Should be unreachable because damaged() calls respawn()
+        ceph_abort();  // Should be unreachable because damaged() calls
+                    // respawn()
       }
     }
     le->set_start_off(pos);
  • Undid unrelated whitespace changtes.
  • Fixed dynamic_cast.

@batrick
Copy link
Member Author

batrick commented Dec 2, 2022

Added tracker ticket forr potential backport: https://tracker.ceph.com/issues/58154

@vshankar
Copy link
Contributor

vshankar commented Dec 8, 2022

@vshankar
Copy link
Contributor

vshankar commented Dec 8, 2022

https://tracker.ceph.com/projects/cephfs/wiki/Main#08-Dec-2022

Sorry - wrong branch/link. This is still under test.

@batrick batrick force-pushed the mds-segment-event branch 2 times, most recently from 92f0fe0 to 28d1a99 Compare February 1, 2023 00:57
@github-actions github-actions bot added the tests label Feb 1, 2023
@batrick
Copy link
Member Author

batrick commented Feb 1, 2023

@vshankar see qa: add numerous subtree test.

Will work on adding

https://tracker.ceph.com/issues/58550

to this PR next.

batrick added 21 commits August 1, 2023 11:16
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The major problem here is that the MDLog::_start_entry method puts the
current event sequence number in the EMetaBlob of the event (if
present). Because of this, no other event can be submitted as this would
invalidate the event sequence. Instead, fixup the event sequence during
submission and simplify related logic that uses it during EMetaBlob
construction.

Secondarily, for the purposes of this commit series, _start_entry
introduced recursive locks when generating the ESubtreeMap within
MDLog::_segment_upkeep. So, this commit is a necessary cleanup.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This commit adds a new ESegment event type which can delineate
LogSegments. This event can be used as an alternative to the heavy
weight ESubtreeMap which can be very expensive to generate when the MDS
has a large subtree map.

Fixes: https://tracker.ceph.com/issues/58154
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Fixes: https://tracker.ceph.com/issues/58550
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
When the ESubtreeMap is very large (~5k+ subtrees), the MDS will
end up logging only a few events (as bad as 1) per segment as the
subtree map dominates the segment size.

This test simply creates an artificially large subtree and confirms that
other file system activity completes in a timely manner. This is now
taking advantage of the minor segments which allows for a normal set of
events per log segment (and fewer subtree maps). The test fails on the
current main HEAD.

Historical note: when I first observed this abberant behavior, the
vstart cluster was actually using mds_debug_subtrees = True (the default
for every vstart cluster). This caused the MDS to write out the subtree
map (for debugging reasons) with every event. When testing the MDS with
large subtrees (distributed ephemeral pinning), this caused the MDS to
slow to a trickle of operations per second. Despite this unintentional
misconfiguration, the problem still exists but the number of auth
subtrees must be large for a particlar rank to replicate the behavior.

On main HEAD, the creation of 10k files (workload stage) takes ~110
seconds. On this branch, it takes ~30 seconds.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
When the MDS journal is wiped, EResetJournal is a major segment boundary
as it necessarily begins the journal.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Prior to this set of commits, the MDS would write the ESubtreeMap to the
journal, trim everything up to that segment, then finally force the
trimming of that last segment (`MDLog::trim(0)`). This is awkward in the
new code which preserves a major segment boundary at the beginning of
the journal during trimming. Instead of writing a special case for this
situation, it seems more natural to just use a new "lid" or "cap" event
to mark the beginning of the journal when no subtree map can yet be
written but we need sequence numbers to tie in other MDS tables.

Like ESegment, ELid doesn't actually contain any state. It's just a
marker for the beginning the log after rank deactivation or rank
creation. It can appear in the middle of the log if the shutdown
sequence is interrupted while writing the event but the MDS will skip it
during replay in that case.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
To prevent old MDS from joining a file system with the new log events.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
For killpoint testing.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This change causes the program to exit gracefully when stdin is closed
rather than with a Python exception.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This relies on the new stdin-killer [1] teuthology helper that allows
interacting with the command's stdin.

[1] ceph/teuthology#1846

Fixes: 8bb77ed
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This makes sourcing this for e.g. vstart_runner.py actually useful.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
There's no technical reason to disallow this. The original intent was to
avoid deadlocks but this possibility is already present when interacting
with a teuthology RemoteProcess. Avoiding it only for local processes
does not make sense.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
With [1], these tools are now installed in the teuthology virtualenv.
Update the path in the command arguments so these tools can be run via
sudo.

[1] ceph/teuthology#1846

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
These tools are now available in the $PATH so it's no longer necessary
to remove them.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Now that the teuthology tools can be run in vstart_runner, there's no
reason to override this method.

Importantly, this enables the use of the new stdin-killer tool [1].

[1] ceph/teuthology#1846

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
So stdin-killer and other utilities are installed in the bin directory.
vstart_runner.py now relies on their presence.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick
Copy link
Member Author

batrick commented Aug 1, 2023

Rebased to fix conflict created by merging #51995.

@batrick
Copy link
Member Author

batrick commented Aug 1, 2023

jenkins test make check

1 similar comment
@batrick
Copy link
Member Author

batrick commented Aug 2, 2023

jenkins test make check

@batrick
Copy link
Member Author

batrick commented Aug 2, 2023

jenkins test dashboard cephadm

@batrick
Copy link
Member Author

batrick commented Aug 2, 2023

@vshankar tests pass now

@vshankar
Copy link
Contributor

vshankar commented Aug 2, 2023

@vshankar tests pass now

Nice. Merging this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants