Project

General

Profile

Actions

Feature #61866

closed

MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings

Added by Patrick Donnelly over 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Development
Backport:
squid,reef,quincy
Reviewed:
Affected Versions:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Tags (freeform):
Fixed In:
v19.3.0-1930-g18c7799cce
Released In:
v20.2.0~2989
Upkeep Timestamp:
2025-11-01T01:22:44+00:00

Description

If an MDS is already having issues with getting behind on trimming its journal or an oversized cache, restarting it may only create new problems with very slow recovery. In particular, if the MDS gets very behind on trimming its journal with 1M or more segments, replay can take hours or longer.

We already track these warnings in MDSMonitor so do a simple check to help the operator or support folks not shoot themselves in the foot.


Related issues 4 (0 open4 closed)

Related to CephFS - Bug #65841: qa: dead job from `tasks.cephfs.test_admin.TestFSFail.test_with_health_warn_oversize_cache`ResolvedRishabh Dave

Actions
Copied to CephFS - Backport #65927: reef: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warningsResolvedRishabh DaveActions
Copied to CephFS - Backport #65928: quincy: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warningsResolvedRishabh DaveActions
Copied to CephFS - Backport #66330: squid: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warningsResolvedRishabh DaveActions
Actions #1

Updated by Venky Shankar over 2 years ago

  • Category set to Administration/Usability
  • Assignee set to Manish Yathnalli
Actions #2

Updated by Venky Shankar over 2 years ago

  • Priority changed from Urgent to Immediate

Manish, please take this one on prio.

Actions #3

Updated by Manish Yathnalli over 2 years ago

  • Status changed from New to In Progress

I will take a look Venky.

Actions #4

Updated by Venky Shankar about 2 years ago

  • Assignee changed from Manish Yathnalli to Venky Shankar
  • Backport changed from reef,quincy,pacific to reef,quincy
Actions #5

Updated by Venky Shankar about 2 years ago

  • Assignee changed from Venky Shankar to Rishabh Dave

Rishabh, please take this one.

Actions #6

Updated by Patrick Donnelly about 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 56066
Actions #7

Updated by Rishabh Dave almost 2 years ago

Patrick, should we include other health warnings too? I didn't include it in PR because it was mentioned on this ticket. Since Venky too brought this up here, I think it's worth discussing and writing a fix for it.

Copying Venky's comment below -


What about other health warnings?

enum mds_metric_t {
  MDS_HEALTH_NULL = 0,
  MDS_HEALTH_TRIM,
  MDS_HEALTH_CLIENT_RECALL,
  MDS_HEALTH_CLIENT_LATE_RELEASE,
  MDS_HEALTH_CLIENT_RECALL_MANY,
  MDS_HEALTH_CLIENT_LATE_RELEASE_MANY,
  MDS_HEALTH_CLIENT_OLDEST_TID,
  MDS_HEALTH_CLIENT_OLDEST_TID_MANY,
  MDS_HEALTH_DAMAGE,
  MDS_HEALTH_READ_ONLY,
  MDS_HEALTH_SLOW_REQUEST,
  MDS_HEALTH_CACHE_OVERSIZED,
  MDS_HEALTH_SLOW_METADATA_IO,
  MDS_HEALTH_CLIENTS_LAGGY,
  MDS_HEALTH_CLIENTS_LAGGY_MANY,
  MDS_HEALTH_DUMMY, // not a real health warning, for testing
};

Esp, MDS_HEALTH_SLOW_REQUEST - where the MDS could probably running close to its limits.
Actions #8

Updated by Patrick Donnelly almost 2 years ago

Rishabh Dave wrote in #note-7:

Patrick, should we include other health warnings too? I didn't include it in PR because it was mentioned on this ticket. Since Venky too brought this up here, I think it's worth discussing and writing a fix for it.

Copying Venky's comment below -

[...]

So far as we know, the two main culprits for slow recovery are the ones included in your PR. That is our main concern with gating mds failover. I don't see a strong argument to include the others at this time.

Actions #9

Updated by Rishabh Dave almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Venky Shankar almost 2 years ago

  • Related to Bug #65841: qa: dead job from `tasks.cephfs.test_admin.TestFSFail.test_with_health_warn_oversize_cache` added
Actions #11

Updated by Casey Bodley almost 2 years ago

  • Copied to Backport #65927: reef: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings added
Actions #12

Updated by Casey Bodley almost 2 years ago

  • Copied to Backport #65928: quincy: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings added
Actions #14

Updated by Rishabh Dave almost 2 years ago

  • Backport changed from reef,quincy to squid,reef,quincy
Actions #15

Updated by Rishabh Dave almost 2 years ago

  • Copied to Backport #66330: squid: MDSMonitor: require --yes-i-really-mean-it when failing an MDS with MDS_HEALTH_TRIM or MDS_HEALTH_CACHE_OVERSIZED health warnings added
Actions #16

Updated by Upkeep Bot 8 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp set to 2025-07-09T14:06:39+00:00
Actions #17

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 18c7799cce39611d34624cec2cc3013d46887f23
  • Fixed In set to v19.3.0-1930-g18c7799cce
  • Upkeep Timestamp changed from 2025-07-09T14:06:39+00:00 to 2025-08-02T04:57:41+00:00
Actions #18

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2989
  • Upkeep Timestamp changed from 2025-08-02T04:57:41+00:00 to 2025-11-01T01:22:44+00:00
Actions

Also available in: Atom PDF