Skip to content

mds: trim mdlog when segments exceed threshold and trim was idle#60381

Merged
batrick merged 3 commits intoceph:mainfrom
vshankar:wip-66948
Nov 13, 2024
Merged

mds: trim mdlog when segments exceed threshold and trim was idle#60381
batrick merged 3 commits intoceph:mainfrom
vshankar:wip-66948

Conversation

@vshankar
Copy link
Contributor

Should have explained the intent in the commit message, but pushing this out for reviews. See https://tracker.ceph.com/issues/66948#note-9 for an explanation.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@vshankar vshankar requested a review from a team October 17, 2024 15:15
@github-actions github-actions bot added cephfs Ceph File System common labels Oct 17, 2024
@vshankar
Copy link
Contributor Author

@gregsfortytwo - from your question in tracker

This is bizarre, why is the mdlog leaving expired_segments just sitting in the log? Is that formula just wrong, or being used for two purposes that don't quite mesh or something?

I have to check that but it has to do something with minor segment boundary which I haven't looked at closely. This change might not be required if there is a bug in MDLog::_trim_expired_segments() or if it's keeping segments around for more time than expected.

@vshankar
Copy link
Contributor Author

From /a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/7791866 where the trim warning was seen, for ./remote/smithi089/log/ceph-mds.f.log.1.gz, there is a flurry of the below log messages for sometime

2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258092/0x1e72da941 events=5)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258092/0x1e72da941 events=5)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258097/0x1e789b9ba events=111)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258097/0x1e789b9ba events=111)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258208/0x1e7c4152e events=39)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258208/0x1e7c4152e events=39)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(258247/0x1e8013480 events=39)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(258247/0x1e8013480 events=39)
...
...
...
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(267055/0x21583f596 events=126)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(267241/0x21602de98 events=42)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments: maybe expiring LogSegment(267241/0x21602de98 events=42)
2024-07-09T23:43:48.946+0000 7f2ed139b640 20 mds.2.log _trim_expired_segments: examining LogSegment(267283/0x216df1773 events=54)
2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log _trim_expired_segments waiting for expiry LogSegment(267283/0x216df1773 events=54)

_trim_expired_segments() removes logseg's from expired_segments only when it sees a major segment. Its been a while that I've looked at the minor/major segment parts in mdlog (last involvement was when the PR was under review) -- I'll have to revisit it.

@batrick
Copy link
Member

batrick commented Oct 17, 2024

2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry EExport 0x100000022e5 to mds.0 [metablob 0x1, 8 dirs]
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 10 mds.2.log _segment_upkeep: starting new segment, current LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry ESegment(0)

The actual problem is that these minor segments are consistently small by event count but too large due to the large EExport events. That's causing new segments to be created via:

} else if (ls->end/period != ls->offset/period || ls->num_events >= events_per_segment) {

I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment:

if (events_since_last_major_segment > events_per_segment*major_segment_event_ratio) {

So if there has been ~8-16 segments since the last major segment, write a new one.

@batrick
Copy link
Member

batrick commented Oct 17, 2024

Note this is only found because we have export thrashing turned on.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this solution because it's basically introducing a post-hoc patch-up, and it's not synchronized with what the Beacon health check is doing, so it could still flash up a warning and then clean up. It's much better if we can follow some straightforward length rules that are checked for in the same way in all relevant parts of the code.

Does Patrick's formula resolve this on its own, or is more needed?

@vshankar
Copy link
Contributor Author

vshankar commented Oct 18, 2024

Note this is only found because we have export thrashing turned on.

Downstream QE is able to reproduce this and I don't think their tests were thrashing MDSs.

EDIT: thrashing exports.

@vshankar
Copy link
Contributor Author

2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry EExport 0x100000022e5 to mds.0 [metablob 0x1, 8 dirs]
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 10 mds.2.log _segment_upkeep: starting new segment, current LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry ESegment(0)

The actual problem is that these minor segments are consistently small by event count but too large due to the large EExport events. That's causing new segments to be created via:

} else if (ls->end/period != ls->offset/period || ls->num_events >= events_per_segment) {

I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment:

if (events_since_last_major_segment > events_per_segment*major_segment_event_ratio) {

So if there has been ~8-16 segments since the last major segment, write a new one.

This would create major segments (ESubtreeMap) more frequently. It's definitely better than what it was originally (which was to log an ESubtreeMap event) when:

  } else if (ls->end/period != ls->offset/period ||
             ls->num_events >= g_conf()->mds_log_events_per_segment) {
    dout(10) << "submit_entry also starting new segment: last = "
             << ls->seq  << "/" << ls->offset << ", event seq = " << event_seq << dendl;
    _start_new_segment();

Now, logging a major segment boundary after 8 (or whatever) number of minor segments would avoid long wait periods when waiting for segments to expire to eventually trim them, but its still looks like a relatively large increase from logging a major segment after (1024 * 12) events. And the downstream reproducer didn't involve export thrashing (I will have to double check), so I'm still curious if this is the only thing we need to workaround the trim issue.

@batrick
Copy link
Member

batrick commented Oct 18, 2024

2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry EExport 0x100000022e5 to mds.0 [metablob 0x1, 8 dirs]
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 10 mds.2.log _segment_upkeep: starting new segment, current LogSegment(267181/0x215c1638c events=60)
2024-07-09T23:36:06.567+0000 7f2ed5ba4640 20 mds.2.log _submit_entry ESegment(0)

The actual problem is that these minor segments are consistently small by event count but too large due to the large EExport events. That's causing new segments to be created via:

} else if (ls->end/period != ls->offset/period || ls->num_events >= events_per_segment) {

I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment:

if (events_since_last_major_segment > events_per_segment*major_segment_event_ratio) {

So if there has been ~8-16 segments since the last major segment, write a new one.

This would create major segments (ESubtreeMap) more frequently. It's definitely better than what it was originally (which was to log an ESubtreeMap event) when:

  } else if (ls->end/period != ls->offset/period ||
             ls->num_events >= g_conf()->mds_log_events_per_segment) {
    dout(10) << "submit_entry also starting new segment: last = "
             << ls->seq  << "/" << ls->offset << ", event seq = " << event_seq << dendl;
    _start_new_segment();

Now, logging a major segment boundary after 8 (or whatever) number of minor segments would avoid long wait periods when waiting for segments to expire to eventually trim them, but its still looks like a relatively large increase from logging a major segment after (1024 * 12) events. And the downstream reproducer didn't involve export thrashing (I will have to double check), so I'm still curious if this is the only thing we need to workaround the trim issue.

Perhaps the downstream issue is slightly different?

@vshankar
Copy link
Contributor Author

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

src/mds/MDLog.cc Outdated
Comment on lines +741 to +742
auto interval = std::chrono::duration<double>(last_trim - trim_start);
auto should_trim = is_oversegmented() && (interval.count() >= oversegmented_idle_interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the timestamp difference expression be trim_start - last_trim instead ?
I think the expression last_trim - trim_start might turn out negative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should only be valid for the first iteration.

@vshankar
Copy link
Contributor Author

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster.

I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler).

@batrick
Copy link
Member

batrick commented Oct 21, 2024

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster.

I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler).

The point of major/minor segments is to enforce the constraint that the MDS always begins replay with a major segment. We cannot trim minor segments except up to the next major segment.

@vshankar
Copy link
Contributor Author

@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them.

The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster.
I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler).

The point of major/minor segments is to enforce the constraint that the MDS always begins replay with a major segment. We cannot trim minor segments except up to the next major segment.

ACK. I was doubting that if we do journaler->set_expire_pos() for a segment and then crash, but the expire pos gets written, but I see a write_head() is done when wirtes out the updated positions. So we are good. I'll make the changes related to major segments.

Question: Any value in additionally keeping the current changes? (as a failsafe probably)

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM

min: 1
services:
- mds
- name: mds_log_major_segment_event_ratio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change broke the docs

@batrick batrick removed their assignment Oct 22, 2024
@vshankar vshankar requested a review from a team as a code owner October 23, 2024 06:39
@vshankar
Copy link
Contributor Author

jenkins test api

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Credit goes to Patrick (@batrick) for identifying this.

When there are huge number of subtree exports (such as done in export
thrashing test), the MDS would log an EExport event. The EExport event
is relatively large in size. This causes the MDS to log new minor log
segments frequently. Moreover, the MDS logs a major segment (boundary)
after a certain number of events have been logged. This casues large
number of (minor) events to get build up and cause delays in trimming
expired segments, since journal expire position is updated on segment
boundaries.

To mitigate this issue, the MDS now starts a major segment after a
configured number of minor segments have been logged. This threshold
is configurable by adjusting `mds_log_minor_segments_per_major_segment`
MDS config (defaults to 16).

Fixes: https://tracker.ceph.com/issues/66948
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@batrick
Copy link
Member

batrick commented Oct 23, 2024

Adding this to my next batch but don't necessarily wait for me.

Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of docs requests

.. confval:: mds_log_events_per_segment

The frequency of major segments (noted by the journaling of the latest ``ESubtreeMap``) is controlled by:
The number of minor mds log segments since last major segment is controlled by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDS log segments since the last major

type: uint
level: advanced
desc: number of minor segments per major segment.
long_desc: The number of minor mds log segments since last major segment after which a major segment is started/logged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDS log segments since the last

@batrick
Copy link
Member

batrick commented Oct 25, 2024

jenkins test api

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/68744.

@batrick
Copy link
Member

batrick commented Oct 31, 2024

This PR is under test in https://tracker.ceph.com/issues/68786.

@batrick
Copy link
Member

batrick commented Nov 6, 2024

This PR is under test in https://tracker.ceph.com/issues/68859.

@batrick
Copy link
Member

batrick commented Nov 13, 2024

@batrick batrick merged commit 9d2b3aa into ceph:main Nov 13, 2024
@batrick
Copy link
Member

batrick commented Nov 13, 2024

followup qa fix: #60720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants