mds: trim mdlog when segments exceed threshold and trim was idle by vshankar · Pull Request #60381 · ceph/ceph

vshankar · 2024-10-17T15:15:37Z

Should have explained the intent in the commit message, but pushing this out for reviews. See https://tracker.ceph.com/issues/66948#note-9 for an explanation.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

vshankar · 2024-10-17T15:28:23Z

@gregsfortytwo - from your question in tracker

This is bizarre, why is the mdlog leaving expired_segments just sitting in the log? Is that formula just wrong, or being used for two purposes that don't quite mesh or something?

I have to check that but it has to do something with minor segment boundary which I haven't looked at closely. This change might not be required if there is a bug in MDLog::_trim_expired_segments() or if it's keeping segments around for more time than expected.

vshankar · 2024-10-17T16:41:57Z

From /a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/7791866 where the trim warning was seen, for ./remote/smithi089/log/ceph-mds.f.log.1.gz, there is a flurry of the below log messages for sometime

2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258092/0x1e72da941 events=5)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258092/0x1e72da941 events=5)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258097/0x1e789b9ba events=111)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258097/0x1e789b9ba events=111)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258208/0x1e7c4152e events=39)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258208/0x1e7c4152e events=39)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258247/0x1e8013480 events=39)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258247/0x1e8013480 events=39)
...
...
...
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(267055/0x21583f596 events=126)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(267241/0x21602de98 events=42)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(267241/0x21602de98 events=42)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(267283/0x216df1773 events=54)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments waiting for expiry LogSegment(267283/0x216df1773 events=54)

_trim_expired_segments() removes logseg's from expired_segments only when it sees a major segment. Its been a while that I've looked at the minor/major segment parts in mdlog (last involvement was when the PR was under review) -- I'll have to revisit it.

batrick · 2024-10-17T18:02:49Z

2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry EExport 0x100000022e5 to mds.0 [metablob 0x1, 8 dirs]
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 10 mds.2.log _segment_upkeep: starting new segment, current LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry ESegment(0)

The actual problem is that these minor segments are consistently small by event count but too large due to the large EExport events. That's causing new segments to be created via:

ceph/src/mds/MDLog.cc

Line 410 in 2966f22

    
           } else if (ls->end/period != ls->offset/period || ls->num_events >= events_per_segment) {

I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment:

ceph/src/mds/MDLog.cc

Line 406 in 2966f22

    
           if (events_since_last_major_segment > events_per_segment*major_segment_event_ratio) {

So if there has been ~8-16 segments since the last major segment, write a new one.

batrick · 2024-10-17T18:04:23Z

Note this is only found because we have export thrashing turned on.

gregsfortytwo

I don't like this solution because it's basically introducing a post-hoc patch-up, and it's not synchronized with what the Beacon health check is doing, so it could still flash up a warning and then clean up. It's much better if we can follow some straightforward length rules that are checked for in the same way in all relevant parts of the code.

Does Patrick's formula resolve this on its own, or is more needed?

vshankar · 2024-10-18T04:44:45Z

Note this is only found because we have export thrashing turned on.

Downstream QE is able to reproduce this and I don't think their tests were thrashing MDSs.

EDIT: thrashing exports.

vshankar · 2024-10-18T11:42:35Z

2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry EExport 0x100000022e5 to mds.0 [metablob 0x1, 8 dirs]
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 10 mds.2.log _segment_upkeep: starting new segment, current LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry ESegment(0)
The actual problem is that these minor segments are consistently small by event count but too large due to the large EExport events. That's causing new segments to be created via:

ceph/src/mds/MDLog.cc

Line 410 in 2966f22

} else if (ls->end/period != ls->offset/period || ls->num_events >= events_per_segment) {

I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment:

ceph/src/mds/MDLog.cc

Line 406 in 2966f22

if (events_since_last_major_segment > events_per_segment*major_segment_event_ratio) {

So if there has been ~8-16 segments since the last major segment, write a new one.

This would create major segments (ESubtreeMap) more frequently. It's definitely better than what it was originally (which was to log an ESubtreeMap event) when:

  } else if (ls->end/period != ls->offset/period ||
             ls->num_events >= g_conf()->mds_log_events_per_segment) {
    dout(10) << "submit_entry also starting new segment: last = "
             << ls->seq  << "/" << ls->offset << ", event seq = " << event_seq << dendl;
    _start_new_segment();

Now, logging a major segment boundary after 8 (or whatever) number of minor segments would avoid long wait periods when waiting for segments to expire to eventually trim them, but its still looks like a relatively large increase from logging a major segment after (1024 * 12) events. And the downstream reproducer didn't involve export thrashing (I will have to double check), so I'm still curious if this is the only thing we need to workaround the trim issue.

batrick · 2024-10-18T13:45:37Z

2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry EExport 0x100000022e5 to mds.0 [metablob 0x1, 8 dirs]
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 10 mds.2.log _segment_upkeep: starting new segment, current LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry ESegment(0)
The actual problem is that these minor segments are consistently small by event count but too large due to the large EExport events. That's causing new segments to be created via:

ceph/src/mds/MDLog.cc

Line 410 in 2966f22

} else if (ls->end/period != ls->offset/period || ls->num_events >= events_per_segment) {

I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment:

ceph/src/mds/MDLog.cc

Line 406 in 2966f22

if (events_since_last_major_segment > events_per_segment*major_segment_event_ratio) {

So if there has been ~8-16 segments since the last major segment, write a new one.
This would create major segments (ESubtreeMap) more frequently. It's definitely better than what it was originally (which was to log an ESubtreeMap event) when:
  } else if (ls->end/period != ls->offset/period ||
             ls->num_events >= g_conf()->mds_log_events_per_segment) {
    dout(10) << "submit_entry also starting new segment: last = "
             << ls->seq  << "/" << ls->offset << ", event seq = " << event_seq << dendl;
    _start_new_segment();
Now, logging a major segment boundary after 8 (or whatever) number of minor segments would avoid long wait periods when waiting for segments to expire to eventually trim them, but its still looks like a relatively large increase from logging a major segment after (1024 * 12) events. And the downstream reproducer didn't involve export thrashing (I will have to double check), so I'm still curious if this is the only thing we need to workaround the trim issue.

Perhaps the downstream issue is slightly different?

vshankar · 2024-10-18T13:59:16Z

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

mchangir · 2024-10-21T10:14:20Z

src/mds/MDLog.cc

+    auto interval = std::chrono::duration<double>(last_trim - trim_start);
+    auto should_trim = is_oversegmented() && (interval.count() >= oversegmented_idle_interval);


Should the timestamp difference expression be trim_start - last_trim instead ?
I think the expression last_trim - trim_start might turn out negative.

That should only be valid for the first iteration.

vshankar · 2024-10-21T10:57:05Z

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster.

I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler).

batrick · 2024-10-21T13:30:35Z

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster.

I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler).

The point of major/minor segments is to enforce the constraint that the MDS always begins replay with a major segment. We cannot trim minor segments except up to the next major segment.

vshankar · 2024-10-21T15:05:32Z

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster.
I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler).

The point of major/minor segments is to enforce the constraint that the MDS always begins replay with a major segment. We cannot trim minor segments except up to the next major segment.

ACK. I was doubting that if we do journaler->set_expire_pos() for a segment and then crash, but the expire pos gets written, but I see a write_head() is done when wirtes out the updated positions. So we are good. I'll make the changes related to major segments.

Question: Any value in additionally keeping the current changes? (as a failsafe probably)

src/mds/MDLog.cc

src/mds/Beacon.cc

src/common/options/mds.yaml.in

batrick

otherwise LGTM

batrick · 2024-10-22T12:37:29Z

src/common/options/mds.yaml.in

  min: 1
  services:
  - mds
- name: mds_log_major_segment_event_ratio


this change broke the docs

src/mds/MDLog.cc

vshankar · 2024-10-23T10:46:04Z

jenkins test api

doc/cephfs/mds-journaling.rst

Signed-off-by: Venky Shankar <vshankar@redhat.com>

@batrick

Credit goes to Patrick (@batrick) for identifying this. When there are huge number of subtree exports (such as done in export thrashing test), the MDS would log an EExport event. The EExport event is relatively large in size. This causes the MDS to log new minor log segments frequently. Moreover, the MDS logs a major segment (boundary) after a certain number of events have been logged. This casues large number of (minor) events to get build up and cause delays in trimming expired segments, since journal expire position is updated on segment boundaries. To mitigate this issue, the MDS now starts a major segment after a configured number of minor segments have been logged. This threshold is configurable by adjusting `mds_log_minor_segments_per_major_segment` MDS config (defaults to 16). Fixes: https://tracker.ceph.com/issues/66948 Signed-off-by: Venky Shankar <vshankar@redhat.com>

Signed-off-by: Venky Shankar <vshankar@redhat.com>

batrick · 2024-10-23T14:55:37Z

Adding this to my next batch but don't necessarily wait for me.

anthonyeleven

Couple of docs requests

anthonyeleven · 2024-10-23T15:22:26Z

doc/cephfs/mds-journaling.rst

 .. confval:: mds_log_events_per_segment

-The frequency of major segments (noted by the journaling of the latest ``ESubtreeMap``) is controlled by:
+The number of minor mds log segments since last major segment is controlled by:


MDS log segments since the last major

anthonyeleven · 2024-10-23T15:22:55Z

src/common/options/mds.yaml.in

+  type: uint
+  level: advanced
+  desc: number of minor segments per major segment.
+  long_desc: The number of minor mds log segments since last major segment after which a major segment is started/logged.


MDS log segments since the last

batrick · 2024-10-25T15:08:09Z

jenkins test api

vshankar · 2024-10-29T04:54:33Z

This PR is under test in https://tracker.ceph.com/issues/68744.

batrick · 2024-10-31T00:41:28Z

This PR is under test in https://tracker.ceph.com/issues/68786.

batrick · 2024-11-06T14:21:35Z

This PR is under test in https://tracker.ceph.com/issues/68859.

batrick · 2024-11-13T03:08:21Z

https://tracker.ceph.com/issues/68859#note-2

addressed

batrick · 2024-11-13T03:32:13Z

followup qa fix: #60720

vshankar requested a review from a team October 17, 2024 15:15

vshankar assigned batrick and kotreshhr Oct 17, 2024

github-actions bot added cephfs Ceph File System common labels Oct 17, 2024

gregsfortytwo previously requested changes Oct 17, 2024

View reviewed changes

mchangir reviewed Oct 21, 2024

View reviewed changes

batrick requested changes Oct 21, 2024

View reviewed changes

src/mds/MDLog.cc Outdated Show resolved Hide resolved

src/mds/Beacon.cc Outdated Show resolved Hide resolved

src/common/options/mds.yaml.in Outdated Show resolved Hide resolved

vshankar force-pushed the wip-66948 branch from 1521945 to c06fe27 Compare October 22, 2024 08:11

batrick requested changes Oct 22, 2024

View reviewed changes

batrick removed their assignment Oct 22, 2024

vshankar force-pushed the wip-66948 branch from c06fe27 to 76b42f8 Compare October 23, 2024 06:39

vshankar requested a review from a team as a code owner October 23, 2024 06:39

github-actions bot added the documentation label Oct 23, 2024

vshankar requested review from batrick and gregsfortytwo October 23, 2024 11:07

batrick requested changes Oct 23, 2024

View reviewed changes

doc/cephfs/mds-journaling.rst Show resolved Hide resolved

mds: make parts of mdlog reusable to be used by beacon

7d11c70

Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar added 2 commits October 23, 2024 14:01

doc: remove refrences to mds_log_major_segment_event_ratio

f54be33

Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar force-pushed the wip-66948 branch from 76b42f8 to f54be33 Compare October 23, 2024 14:01

batrick approved these changes Oct 23, 2024

View reviewed changes

batrick added the wip-pdonnell-testing label Oct 23, 2024

anthonyeleven approved these changes Oct 23, 2024

View reviewed changes

vshankar added the wip-vshankar-testing label Oct 24, 2024

batrick merged commit 9d2b3aa into ceph:main Nov 13, 2024

vshankar mentioned this pull request Nov 26, 2024

squid: mds: trim mdlog when segments exceed threshold and trim was idle #60838

Merged

		auto interval = std::chrono::duration<double>(last_trim - trim_start);
		auto should_trim = is_oversegmented() && (interval.count() >= oversegmented_idle_interval);

Conversation

vshankar commented Oct 17, 2024

Contribution Guidelines

Checklist

Uh oh!

vshankar commented Oct 17, 2024

Uh oh!

vshankar commented Oct 17, 2024

Uh oh!

batrick commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

batrick commented Oct 17, 2024

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

vshankar commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar commented Oct 18, 2024

Uh oh!

batrick commented Oct 18, 2024

Uh oh!

vshankar commented Oct 18, 2024

Uh oh!

mchangir Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

vshankar Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

vshankar commented Oct 21, 2024

Uh oh!

batrick commented Oct 21, 2024

Uh oh!

vshankar commented Oct 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

batrick Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vshankar commented Oct 23, 2024

Uh oh!

Uh oh!

batrick commented Oct 23, 2024

Uh oh!

anthonyeleven left a comment

Choose a reason for hiding this comment

Uh oh!

anthonyeleven Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

anthonyeleven Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

batrick commented Oct 25, 2024

Uh oh!

vshankar commented Oct 29, 2024

Uh oh!

batrick commented Oct 31, 2024

Uh oh!

batrick commented Nov 6, 2024

Uh oh!

batrick commented Nov 13, 2024

Uh oh!

batrick commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

batrick commented Oct 17, 2024 •

edited

Loading

vshankar commented Oct 18, 2024 •

edited

Loading