Bug #48673
closedHigh memory usage on standby replay MDS
0%
Description
Hi.
We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.
The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2165668 ceph 20 0 27.6g 26.1g 22088 S 12.3 13.9 2081:55 ceph-mds
However, we have problems with the standby replay node 3 with a large memory footprint.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2166195 ceph 20 0 40.7g 38.2g 21000 S 0.7 20.4 86:31.18 ceph-mds
This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mdsnode3(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files
The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.
If you want any extra information, please ask.
Best regards
Daniel
Updated by Patrick Donnelly over 5 years ago
Daniel Persson wrote:
Hi.
We have recently installed a Ceph cluster and with about 27M objects. The filesystem seems to have 15M files.
The MDS is configured with a 20Gb mds_cache_memory_limit. If we look at the nodes, the memory keeps a bit above the limit on the active node 4 but not extremely so.
[...]
However, we have problems with the standby replay node 3 with a large memory footprint.
[...]
This level has remained constant for days. We have received warnings from the cluster reset a couple of times, even if the memory footprint has not changed.
[...]
The nodes also run a couple of OSDs, and we don't want them to be affected now that we soon go for the Xmas holidays, so I thought I open a ticket here and see if we can get any suggestions on preventive measures from now on.
If you want any extra information, please ask.
Please share `ceph versions` and `ceph fs dump`.
I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.
Updated by Daniel Persson over 5 years ago
Patrick Donnelly wrote:
Please share `ceph versions` and `ceph fs dump`.
I believe we've recently fixed some issues with standby-replay daemons using too much memory. Those fixes would have been backported. Please try upgrading to the latest version of nautilus or octopus to see if that helps.
Hi Patrick.
Thank you for the quick reply. I thought the affected version was what to set to supply the version we have. And I've looked at the last two changelogs for the 15.2.6 and 15.2.7 and did not see anything mentioning memory fixes.
To be clear, the 15 OSDs on 15.2.6 have data. The other 14 are slower hardware that we have connected but don't carry any data at the moment.
Best regards
Daniel
{
"mon": {
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
},
"mgr": {
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
},
"osd": {
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 14,
"ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 15
},
"mds": {
"ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 3
},
"rgw": {
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 3
},
"overall": {
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 23,
"ceph version 15.2.6 (cb8c61a60551b72614257d632a574d420064c17a) octopus (stable)": 18
}
}
dumped fsmap epoch 20853
e20853
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
Filesystem 'cephfs' (1)
fs_name cephfs
epoch 20853
flags 32
created 2020-11-02T13:38:07.192474+0100
modified 2020-12-21T15:10:24.295566+0100
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
min_compat_client 0 (unknown)
last_failure 0
last_failure_osd_epoch 0
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=148406}
failed
damaged
stopped 1
data_pools [2]
metadata_pool 3
inline_data disabled
balancer
standby_count_wanted 1
[mds.node4{0:148406} state up:active seq 39179 addr [v2:-----:6820/3504326627,v1:-----:6821/3504326627]]
[mds.node3{0:134983} state up:standby-replay seq 260130 addr [v2:-----:6820/2104241553,v1:-----:6821/2104241553]]
Standby daemons:
[mds.node2{-1:134965} state up:standby seq 2 addr [v2:-----:6820/340836297,v1:-----:6821/340836297]]
- Ip addresses replaced with -----
Updated by Patrick Donnelly over 5 years ago
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Updated by Patrick Donnelly about 5 years ago
- Status changed from New to Need More Info
Updated by Daniel Persson about 5 years ago
Patrick Donnelly wrote:
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Hi Patrick.
We have now updated the cluster and all the clients, and it's now running on v15.2.8.
Rank State Daemon Activity Dentries Inodes 0 active node3 Reqs: 20.4 /s 6.9 M 6.9 M 0-s standby-replay node4 Evts: 0 /s 12.4 M 12.4 M
node3
352481 ceph 20 0 25.3g 24.8g 20508 S 5.0 13.2 657:22.25 ceph-mds
node4
3812398 ceph 20 0 41.8g 41.4g 20988 S 0.7 22.1 23:28.38 ceph-mds
We still see that the standby MDS holds a lot more entries and also more memory than requested.
=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.node4(mds.0): MDS cache is too large (30GB/20GB); 0 inodes in use by clients, 0 stray files
Please tell us if there is any other information we could provide for your work.
Best regards
Daniel
Updated by Tom Myny about 5 years ago
Hello,
We have noticed the same behavior in ceph v15.2.3 and v15.2.8
Note, this is not the case with all filesystems.
ANK STATE MDS ACTIVITY DNS INOS 0 active web.ceph1.ahytos Reqs: 139 /s 7283k 7273k 0-s standby-replay web.ceph2.hjydph Evts: 123 /s 24.0M 24.0M
Updated by Julian Einwag almost 5 years ago
Hi,
we are experiencing the same behavior, but with ceph 14.2.18. Memory usage of the standby-replay MDS keeps growing and growing. I can easily reproduce this issue by simply running find over the while filesystem.
Updated by Patrick Donnelly almost 5 years ago
Daniel Persson wrote:
Patrick Donnelly wrote:
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Hi Patrick.
We have now updated the cluster and all the clients, and it's now running on v15.2.8.
[...]
node3
[...]node4
[...]We still see that the standby MDS holds a lot more entries and also more memory than requested.
[...]
Please tell us if there is any other information we could provide for your work.
Please try this command to see if that helps improve things:
ceph config set mds mds_cache_trim_threshold 256K
or even
ceph config set mds mds_cache_trim_threshold 512K
Updated by Daniel Persson almost 5 years ago
Hi Patrick.
I've tried to run the cluster with both settings for 24 hours each. It became slightly worse, but that might be because it consigned with some backup routines.
Rank State Daemon Activity Dentries Inodes
0 active node3 Reqs: 15.2 /s 6.8 M 6.8 M
0-s standby-replay node4 Evts: 0 /s 15.2 M 15.2 M
I have not seen the Evicts go over 0 /s, which seems a bit strange. Should be doing that at least a couple per second if there is a lot of activity on the active node.
I've previously tried to follow one guide at Suse for increasing trimming by 10%, but it only seems to affect the active node and not the standby-replay one.
https://www.suse.com/support/kb/doc/?id=000019740
Best regards
Daniel
Patrick Donnelly wrote:
Daniel Persson wrote:
Patrick Donnelly wrote:
Thanks for the information. There were a few fixes in v15.2.8 relating to memory consumption for the MDS which may be related to this. Please try upgrading to that version and report back.
Hi Patrick.
We have now updated the cluster and all the clients, and it's now running on v15.2.8.
[...]
node3
[...]node4
[...]We still see that the standby MDS holds a lot more entries and also more memory than requested.
[...]
Please tell us if there is any other information we could provide for your work.
Please try this command to see if that helps improve things:
ceph config set mds mds_cache_trim_threshold 256K
or even
ceph config set mds mds_cache_trim_threshold 512K
Updated by Howie C over 4 years ago
We are seeing the same issue on pacific 16.2.5 as well. Not a big issue but very annoying.
homes - 3 clients
=====
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active homes.ceph1m01.iakegt Reqs: 1780 /s 2800k 2800k 359k 72.1k
1 active homes.ceph1m02.khomui Reqs: 0 /s 862k 860k 114k 84.0k
0-s standby-replay homes.ceph1m03.waoiry Evts: 2902 /s 14.0M 14.0M 1582k 0
1-s standby-replay homes.ceph1m01.rwitvl Evts: 0 /s 862k 860k 113k 0
POOL TYPE USED AVAIL
cephfs.homes.meta metadata 18.3G 52.8T
cephfs.homes.data data 2807G 52.8T
MDS version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
Updated by Patrick Donnelly over 4 years ago
- Related to Bug #50048: mds: standby-replay only trims cache when it reaches the end of the replay log added
Updated by Patrick Donnelly over 4 years ago
- Status changed from Need More Info to In Progress
- Assignee set to Patrick Donnelly
- Target version set to v17.0.0
I've been able to reproduce this. Will try to track down the cause...
Updated by Yongseok Oh over 4 years ago
Patrick Donnelly wrote:
I've been able to reproduce this. Will try to track down the cause...
The same situation happens with standby-replay daemons in our cluster. It seems that dentries are rarely trimmed as dentry's linkage inode is not set to nullptr. Please refer to the line. https://github.com/ceph/ceph/blob/master/src/mds/MDCache.cc#L6688
MDCache::standby_trim_segment() tries to trim inodes and dentries and then moves them to the last position of the LRU list. But, dentry's linkage inode is still valid. CDir::unlink_inode() may not be called between the standby_trim_segment() and trim_lru() calls.
Could you briefly describe when/where dentry's linkage inode is invalidated during replaying journals?
It can be observed that trimming is done successfully when the commit is reverted. https://github.com/ceph/ceph/pull/40963 It however incurs recory failures.
Updated by Mykola Golub about 4 years ago
Patrick, do you have any comments for the last comment from Yongseok Oh? Our customer also observes uncontrolled memory growth for an mds in standby-replay state, and we believe the root cause is what Yongseok Oh described.
Updated by Venky Shankar about 4 years ago
Yongseok/Mykola - Patrick is on PTO - I'll try to make progress on this issue.
Yongseok, you mention https://github.com/ceph/ceph/pull/40963 which skips trimming inodes for standby-replay - afaiu, that's required to avoid failure during journal replay when an inode gets trimmed but has a corresponding journal entry. So, we would run into issues if we let a standby-replay daemon trim inode from its cache. However, the unbounded memory usage is not favorable either.
I'll try to see if we could have a alternate solution for the same.
Updated by Venky Shankar over 3 years ago
- Priority changed from Normal to High
- Target version set to v18.0.0
- Backport set to pacific,quincy
- Severity changed from 3 - minor to 2 - major
We seem to be running into this pretty frequently and easily with standby-replay configuration.
Updated by Patrick Donnelly over 3 years ago
- Status changed from In Progress to Fix Under Review
- Backport changed from pacific,quincy to quincy,pacific
- Pull request ID set to 48483
Updated by Patrick Donnelly over 3 years ago
- Related to Bug #40213: mds: cannot switch mds state from standby-replay to active added
Updated by Patrick Donnelly over 3 years ago
- Related to Bug #50246: mds: failure replaying journal (EMetaBlob) added
Updated by Joshua Hoblitt over 2 years ago
I believe that I have observed this issue while trying to reproduce a different mds problem. It manifests by the standby mds cache continuously growing well beyond the configured limit. Commanding the mds to drop the cache does nothing. However, it appears that briefly disabling allow_standby_replay does flush the caches. E.g.:
~ $ ceph fs status auxtel
auxtel - 1 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active auxtel-c Reqs: 0 /s 360k 355k 17.6k 858
1 active auxtel-b Reqs: 0 /s 1494k 1494k 5159 100
2 active auxtel-d Reqs: 0 /s 1051k 1050k 2665 295
0-s standby-replay auxtel-e Evts: 0 /s 139k 129k 15.5k 0
1-s standby-replay auxtel-a Evts: 0 /s 3463k 3462k 5155 0
2-s standby-replay auxtel-f Evts: 0 /s 850k 836k 981 0
POOL TYPE USED AVAIL
auxtel-metadata metadata 10.5G 5927G
auxtel-data0 data 1968G 5927G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
ceph> fs set auxtel allow_standby_replay false
ceph> fs status auxtel
auxtel - 1 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active auxtel-c Reqs: 0 /s 360k 355k 17.6k 858
1 active auxtel-b Reqs: 0 /s 1494k 1494k 5159 1336
2 active auxtel-d Reqs: 0 /s 1051k 1050k 2665 295
POOL TYPE USED AVAIL
auxtel-metadata metadata 10.4G 5925G
auxtel-data0 data 1968G 5925G
STANDBY MDS
auxtel-e
auxtel-f
auxtel-a
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
ceph> fs set auxtel allow_standby_replay true
ceph> fs status auxtel
auxtel - 1 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active auxtel-c Reqs: 0 /s 360k 355k 17.6k 858
1 active auxtel-b Reqs: 0 /s 1494k 1494k 5159 1336
2 active auxtel-d Reqs: 0 /s 1051k 1050k 2665 295
0-s standby-replay auxtel-e Evts: 0 /s 0 0 0 0
1-s standby-replay auxtel-f Evts: 0 /s 0 0 0 0
2-s standby-replay auxtel-a Evts: 0 /s 0 0 0 0
POOL TYPE USED AVAIL
auxtel-metadata metadata 10.4G 5925G
auxtel-data0 data 1968G 5925G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
~ $ k top pods -l app.kubernetes.io/part-of=auxtel
NAME CPU(cores) MEMORY(bytes)
rook-ceph-mds-auxtel-a-dfdfb685f-sdpmd 16m 26Mi
rook-ceph-mds-auxtel-b-7c6875b594-xfmhq 18m 7948Mi
rook-ceph-mds-auxtel-c-5799f48f45-c25ml 19m 3075Mi
rook-ceph-mds-auxtel-d-864f8987cb-77z5f 20m 7246Mi
rook-ceph-mds-auxtel-e-6989dd8b7f-gh8g7 11m 11Mi
rook-ceph-mds-auxtel-f-76cd5f5886-68psl 11m 11Mi
rook-ceph-nfs-auxtel-a-cfcd4cb65-t7pmc 2m 217Mi
Updated by Konstantin Shalygin over 2 years ago
- Target version changed from v18.0.0 to v19.0.0
- Backport changed from quincy,pacific to pacific quincy reef
Updated by Joshua Hoblitt over 2 years ago
This issue triggered again this morning for the first time in 2 weeks. What's note worthy is that the active mds seem to be leaking memory as well. Note the size of mds -d, which is active:
~ $ ceph health detail
HEALTH_WARN 1 MDSs report oversized cache; 2 MDSs report slow requests
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.auxtel-f(mds.1): MDS cache is too large (7GB/4GB); 0 inodes in use by clients, 0 stray files
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
mds.auxtel-c(mds.0): 3 slow requests are blocked > 30 secs
mds.auxtel-d(mds.2): 15482432 slow requests are blocked > 30 secs
~ $ k top pods -l app.kubernetes.io/part-of=auxtel
NAME CPU(cores) MEMORY(bytes)
rook-ceph-mds-auxtel-a-7757c969bc-d48nn 17m 10027Mi
rook-ceph-mds-auxtel-b-cc44b44b9-pn2lp 13m 1349Mi
rook-ceph-mds-auxtel-c-84f59bc477-b4zxg 21m 1352Mi
rook-ceph-mds-auxtel-d-556fbdffdd-lkmfw 1002m 49894Mi
rook-ceph-mds-auxtel-e-5bcfb5cbd-pzvh9 15m 228Mi
rook-ceph-mds-auxtel-f-67444d9d4b-7bwqk 19m 11612Mi
rook-ceph-nfs-auxtel-a-bcf8f7f67-p6cc9 1m 742Mi
ceph> fs status auxtel
auxtel - 1 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active auxtel-c Reqs: 0 /s 334k 334k 1740 268
1 active auxtel-a Reqs: 0 /s 1174k 1174k 5594 50
2 active auxtel-d Reqs: 0 /s 674 591 165 426
1-s standby-replay auxtel-f Evts: 0 /s 3114k 3114k 7285 0
2-s standby-replay auxtel-e Evts: 0 /s 2515 439 149 0
0-s standby-replay auxtel-b Evts: 0 /s 337k 334k 813 0
POOL TYPE USED AVAIL
auxtel-metadata metadata 23.2G 4863G
auxtel-data0 data 1958G 4863G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Updated by Joshua Hoblitt over 2 years ago
I've confirmed that `fs set auxtel allow_standby_replay false` does free the memory leak in the standby mds but doesn't fix the issue with the active mds... so it seems probable that I'm seeing two different mds memory leak issues at the same time.
Updated by Venky Shankar over 2 years ago
- Backport changed from pacific quincy reef to quincy,reef
Updated by Venky Shankar over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #63675: quincy: High memory usage on standby replay MDS added
Updated by Upkeep Bot over 2 years ago
- Copied to Backport #63676: reef: High memory usage on standby replay MDS added
Updated by Upkeep Bot 9 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp set to 2025-07-09T17:11:36+00:00
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 58e7e132147332f5e57ef17e0b17019828e65bb0
- Fixed In set to v18.0.0-7555-g58e7e13214
- Released In set to v19.2.0~1160
- Upkeep Timestamp changed from 2025-07-09T17:11:36+00:00 to 2025-08-02T05:05:27+00:00