Bug #66948
closedmon.a (mon.0) 326 : cluster [WRN] Health check failed: 1 MDSs behind on trimming (MDS_TRIM)" in cluster log
0%
Description
/a/vshankar-2024-07-08_07:21:13-fs-wip-vshankar-testing-20240705.150505-debug-testing-default-smithi/7791866
PRs under test were unrelated, so this trim warning should be inspected.
Updated by Venky Shankar over 1 year ago
- Assignee set to Kotresh Hiremath Ravishankar
Updated by Venky Shankar over 1 year ago
- Priority changed from Normal to High
@Kotresh Hiremath Ravishankar - there is a BZ0 report downstream which has similar warning with squid. Were you able to dig anything interesting here?
Updated by Venky Shankar over 1 year ago
I think threre is a bug in MDLog::trim() where the MDS could not trim segments for a finite amount of time. Basically this part of the code
unsigned num_remaining_segments = (segments.size() - expired_segments.size() - expiring_segments.size());
dout(10) << __func__ << ": new_expiring_segments=" << new_expiring_segments
<< ", num_remaining_segments=" << num_remaining_segments
<< ", max_segments=" << max_segments << dendl;
if ((num_remaining_segments <= max_segments) &&
(max_ev < 0 || (num_events - expiring_events - expired_events) <= (uint64_t)max_ev)) {
dout(10) << __func__ << ": breaking out of trim loop - segments/events fell below ceiling"
<< " max_segments/max_ev" << dendl;
break;
}
num_remaining_segments is being calculated as:
unsigned num_remaining_segments = (segments.size() - expired_segments.size() - expiring_segments.size());
and then compared against max_segments which IMO could cause log skips in trimming log segments.
Updated by Venky Shankar over 1 year ago
- Assignee changed from Kotresh Hiremath Ravishankar to Venky Shankar
Kotresh, I'm taking this one.
Updated by Venky Shankar over 1 year ago
Venky Shankar wrote in #note-4:
I think threre is a bug in MDLog::trim() where the MDS could not trim segments for a finite amount of time. Basically this part of the code
[...]
num_remaining_segmentsis being calculated as:[...]
and then compared against
max_segmentswhich IMO could cause log skips in trimming log segments.
Note that when expiring a log segment takes a considerable amount of time, num_remaining_segments could stay just below max_segments for that duration unless, more segments get added (obviously) or segments expire and are removed from segments map (in _trim_expired_segments()).
Updated by Venky Shankar over 1 year ago ยท Edited
Okay, so this is what's happening - MDLog has code that breaks out of the trim loop when num_remaining_segments falls below (or equal to max_segments) as mentioned in note-4. So, it's possible that the number of segments is more than twice the value of max_segments, however num_remaining_segments is still below max_segments due to the adjustment of expiring/expired segments. In Beacon.cc, the following check
// Detect MDS_HEALTH_TRIM condition
// Indicates MDS is not trimming promptly
{
const auto log_max_segments = mds->mdlog->get_max_segments();
const auto log_warn_factor = g_conf().get_val<double>("mds_log_warn_factor");
if (mds->mdlog->get_num_segments() > (size_t)(log_max_segments * log_warn_factor)) {
CachedStackStringStream css;
*css << "Behind on trimming (" << mds->mdlog->get_num_segments()
<< "/" << log_max_segments << ")";
MDSHealthMetric m(MDS_HEALTH_TRIM, HEALTH_WARN, css->strv());
m.metadata["num_segments"] = stringify(mds->mdlog->get_num_segments());
m.metadata["max_segments"] = stringify(log_max_segments);
health.metrics.push_back(m);
}
}
Only checks if the number go segments is way above (actually twice by default) max_segments which could cause trim warning to be generated even if MDLog considers num_remaining_segments and breaks of out trim loop.
Updated by Venky Shankar over 1 year ago
There is a time window where the MDS had a lot of log segments and the expired counters wasn't increasing. num_remainsing_segments was just below (or around) max_segments. The MDS noticed the increased number of log segments (basically, exceeding log_max_segments * log_warn_factor) and raising trim warning. It would be interesting to see why did the MDS take long time to expire a set of log segments.
Updated by Venky Shankar over 1 year ago
Okay, back to this after I got sidelines elsewhere. So, the mdlog can reach a state where trim would not continue trimming log segments, but the number of segments exceeds mds_log_max_segments * mds_log_warn_factor. E.g.:
./remote/smithi089/log/ceph-mds.f.log.1.gz:2024-07-09T23:43:46.942+0000 7f2ed139b640 10 mds.2.log trim 310 / 128 segments, 44803 / -1 events, 0 (0) expiring, 182 (9191) expired ./remote/smithi089/log/ceph-mds.f.log.1.gz:2024-07-09T23:43:47.946+0000 7f2ed139b640 10 mds.2.log trim 310 / 128 segments, 44861 / -1 events, 0 (0) expiring, 182 (9191) expired ./remote/smithi089/log/ceph-mds.f.log.1.gz:2024-07-09T23:43:48.946+0000 7f2ed139b640 10 mds.2.log trim 310 / 128 segments, 44922 / -1 events, 0 (0) expiring, 182 (9191) expired ./remote/smithi089/log/ceph-mds.f.log.1.gz:2024-07-09T23:43:49.946+0000 7f2ed139b640 10 mds.2.log trim 310 / 128 segments, 44997 / -1 events, 0 (0) expiring, 182 (9191) expired
Here:
- number of segments: 310
- expiring segments: 0
- expired segments: 182
Therefore as per note-4, num_remaining_segments = 128 (310 - 0 - 182), which satisfies the condition to break out of the segment trimming loop. Once this state is reached, no more log segments get trimmed, unless more segments gets added obviously, however, beacon notices that the number of segments (310) exceeds the threshold (128 * 2) and therefore generates MDS_TRIM health warning.
I think it would suffice to trim log segments if no segments were trim for some interval and the number of log segments exceeds the threshold (max_seg * factor).
Updated by Greg Farnum over 1 year ago
This is bizarre, why is the mdlog leaving expired_segments just sitting in the log? Is that formula just wrong, or being used for two purposes that don't quite mesh or something?
Updated by Venky Shankar over 1 year ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 60381
Updated by Venky Shankar over 1 year ago
I pushed out a change to trim mdlog under certain conditions -- we can move the discussion over to the PR.
Updated by Venky Shankar over 1 year ago
- Related to QA Run #68744: wip-vshankar-testing-20241029.045257-debug added
Updated by Venky Shankar over 1 year ago
- Related to deleted (QA Run #68744: wip-vshankar-testing-20241029.045257-debug)
Updated by Patrick Donnelly over 1 year ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to squid,reef
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #68919: squid: mon.a (mon.0) 326 : cluster [WRN] Health check failed: 1 MDSs behind on trimming (MDS_TRIM)" in cluster log added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #68920: reef: mon.a (mon.0) 326 : cluster [WRN] Health check failed: 1 MDSs behind on trimming (MDS_TRIM)" in cluster log added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Venky Shankar over 1 year ago
- Related to Bug #68913: qa: test_export_pin_many: AssertionError: 0 not greater than 0 added
Updated by Matthew Hassel 10 months ago
Hi! Any idea when this may be slated for squid backport? Cluster is version 19.2.1 and hitting an ever incrementing "Behind in MDS" stat.
Updated by Venky Shankar 10 months ago
Matthew Hassel wrote in #note-20:
Hi! Any idea when this may be slated for squid backport? Cluster is version 19.2.1 and hitting an ever incrementing "Behind in MDS" stat.
The next squid point release is under planning. Will get this for 19.2.3.
Updated by Venky Shankar 10 months ago
Venky Shankar wrote in #note-21:
Matthew Hassel wrote in #note-20:
Hi! Any idea when this may be slated for squid backport? Cluster is version 19.2.1 and hitting an ever incrementing "Behind in MDS" stat.
The next squid point release is under planning. Will get this for 19.2.3.
Oh, I just checked. The squid backport https://github.com/ceph/ceph/pull/60838 is already in squid branch. So, when 19.2.3 is cut, the fix will be available.
Updated by Upkeep Bot 9 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp set to 2025-07-08T18:24:24+00:00
Updated by Darrell Enns 8 months ago
For those of us on 19.2.2 with a constantly increasing "behind on trim" count: Is there any workaround? Is it safe to just ignore the warning for now?
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 9d2b3aaa96fdbb7f8e1b231df918ef4aa8aab52c
- Fixed In set to v19.3.0-6036-g9d2b3aaa96
- Upkeep Timestamp changed from 2025-07-08T18:24:24+00:00 to 2025-08-02T04:50:32+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~1630
- Upkeep Timestamp changed from 2025-08-02T04:50:32+00:00 to 2025-11-01T01:35:55+00:00