mds: trim mdlog when segments exceed threshold and trim was idle#60381
mds: trim mdlog when segments exceed threshold and trim was idle#60381
Conversation
|
@gregsfortytwo - from your question in tracker
I have to check that but it has to do something with minor segment boundary which I haven't looked at closely. This change might not be required if there is a bug in MDLog::_trim_expired_segments() or if it's keeping segments around for more time than expected. |
|
From /a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/7791866 where the trim warning was seen, for ./remote/smithi089/log/ceph-mds.f.log.1.gz, there is a flurry of the below log messages for sometime _trim_expired_segments() removes logseg's from expired_segments only when it sees a major segment. Its been a while that I've looked at the minor/major segment parts in mdlog (last involvement was when the PR was under review) -- I'll have to revisit it. |
The actual problem is that these minor segments are consistently small by event count but too large due to the large Line 410 in 2966f22 I think the code needs to not care about the events since the last major segment but instead the number of minor segments since the last major segment: Line 406 in 2966f22 So if there has been ~8-16 segments since the last major segment, write a new one. |
|
Note this is only found because we have export thrashing turned on. |
gregsfortytwo
left a comment
There was a problem hiding this comment.
I don't like this solution because it's basically introducing a post-hoc patch-up, and it's not synchronized with what the Beacon health check is doing, so it could still flash up a warning and then clean up. It's much better if we can follow some straightforward length rules that are checked for in the same way in all relevant parts of the code.
Does Patrick's formula resolve this on its own, or is more needed?
Downstream QE is able to reproduce this and I don't think their tests were thrashing MDSs. EDIT: thrashing exports. |
This would create major segments ( Now, logging a major segment boundary after 8 (or whatever) number of minor segments would avoid long wait periods when waiting for segments to expire to eventually trim them, but its still looks like a relatively large increase from logging a major segment after (1024 * 12) events. And the downstream reproducer didn't involve export thrashing (I will have to double check), so I'm still curious if this is the only thing we need to workaround the trim issue. |
Perhaps the downstream issue is slightly different? |
|
@batrick - Regarding the downstream issue related to trim, there are hints of directories being exported, but it should not be as wild as the export thrash test. The node where the logs are located is a bit unresponsive atm - will details the numbers when I have them. |
src/mds/MDLog.cc
Outdated
| auto interval = std::chrono::duration<double>(last_trim - trim_start); | ||
| auto should_trim = is_oversegmented() && (interval.count() >= oversegmented_idle_interval); |
There was a problem hiding this comment.
Should the timestamp difference expression be trim_start - last_trim instead ?
I think the expression last_trim - trim_start might turn out negative.
There was a problem hiding this comment.
That should only be valid for the first iteration.
The downstream logs too have traces of directory export, so I'm intended to believe that what @batrick mentioned in #60381 (comment) is what happening in the cluster. I also want to discuss if the MDS really needs to expire upto a particular major segment? (before introducing major/minor segments, the MDS would just check if a segment is in expired_segment list before marking it a candidate for trimming (by setting the expire_pos in the journaler). |
The point of major/minor segments is to enforce the constraint that the MDS always begins replay with a major segment. We cannot trim minor segments except up to the next major segment. |
ACK. I was doubting that if we do Question: Any value in additionally keeping the current changes? (as a failsafe probably) |
| min: 1 | ||
| services: | ||
| - mds | ||
| - name: mds_log_major_segment_event_ratio |
|
jenkins test api |
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Credit goes to Patrick (@batrick) for identifying this. When there are huge number of subtree exports (such as done in export thrashing test), the MDS would log an EExport event. The EExport event is relatively large in size. This causes the MDS to log new minor log segments frequently. Moreover, the MDS logs a major segment (boundary) after a certain number of events have been logged. This casues large number of (minor) events to get build up and cause delays in trimming expired segments, since journal expire position is updated on segment boundaries. To mitigate this issue, the MDS now starts a major segment after a configured number of minor segments have been logged. This threshold is configurable by adjusting `mds_log_minor_segments_per_major_segment` MDS config (defaults to 16). Fixes: https://tracker.ceph.com/issues/66948 Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
|
Adding this to my next batch but don't necessarily wait for me. |
anthonyeleven
left a comment
There was a problem hiding this comment.
Couple of docs requests
| .. confval:: mds_log_events_per_segment | ||
|
|
||
| The frequency of major segments (noted by the journaling of the latest ``ESubtreeMap``) is controlled by: | ||
| The number of minor mds log segments since last major segment is controlled by: |
There was a problem hiding this comment.
MDS log segments since the last major
| type: uint | ||
| level: advanced | ||
| desc: number of minor segments per major segment. | ||
| long_desc: The number of minor mds log segments since last major segment after which a major segment is started/logged. |
There was a problem hiding this comment.
MDS log segments since the last
|
jenkins test api |
|
This PR is under test in https://tracker.ceph.com/issues/68744. |
|
This PR is under test in https://tracker.ceph.com/issues/68786. |
|
This PR is under test in https://tracker.ceph.com/issues/68859. |
|
followup qa fix: #60720 |
Should have explained the intent in the commit message, but pushing this out for reviews. See https://tracker.ceph.com/issues/66948#note-9 for an explanation.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e