Project

General

Profile

Actions

Bug #67642

closed

ceph-mgr memory leak (RESTful module, maybe others?)

Added by Nitzan Mordechai over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Category:
restful module
Target version:
% Done:

100%

Source:
Community (user)
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-4348-gb9d6a00756
Released In:
v20.2.0~2199
Upkeep Timestamp:
2025-11-01T01:31:06+00:00

Description

There are two separate reports on the mailing list of memory leaks in the mgr module:

[ceph-users] Memory leak in MGR after upgrading to pacific

After upgrading from Octopus (15.2.17) to Pacific (16.2.12) two days 
ago, I'm noticing that the MGR daemons keep failing over to standby and 
then back every 24hrs.   Watching the output of 'ceph orch ps' I can see 
that the memory consumption of the mgr is steadily growing until it 
becomes unresponsive.

When the mgr becomes unresponsive, tasks such as RESTful calls start to 
fail, and the standby eventually takes over after ~20 minutes. I've 
included a log of memory consumption (in 10 minute intervals) at the end 
of this message. While the cluster recovers during this issue, the loss 
of usage data during the outage, and the fact its occurring is 
problematic.  Any assistance would be appreciated.

Note, this is a cluster that has been upgraded from an original jewel 
based ceph using filestore, through bluestore conversion, container 
conversion, and now to Pacific.    The data below shows memory use with 
three mgr modules enabled:  cephadm, restful, iostat.   By disabling 
iostat, I can reduce the rate of memory consumption increasing to about 
200MB/hr.

[ceph-users] MGR Memory Leak in Restful

We've hit a memory leak in the Manager Restful interface, in versions 
17.2.5 & 17.2.6. On our main production cluster the active MGR grew to 
about 60G until the oom_reaper killed it, causing a successful failover 
and restart of the failed one. We can then see that the problem is 
recurring, actually on all 3 of our clusters.

We've traced this to when we enabled full Ceph monitoring by Zabbix last 
week. The leak is about 20GB per day, and seems to be proportional to 
the number of PGs. For some time we just had the default settings, and 
no memory leak, but had not got around to finding why many of the Zabbix 
items were showing as Access Denied. We traced this to the MGR's MON 
CAPS which were "mon 'profile mgr'".

The MON logs showed recurring:

log_channel(audit) log [DBG] : from='mgr.284576436 192.168.xxx.xxx:0/2356365' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]:  access denied

Changing the MGR CAPS to "mon 'allow *'" and restarting the MGR 
immediately allowed that to work, and all the follow-on REST calls worked.

log_channel(audit) log [DBG] : from='mgr.283590200 192.168.xxx.xxx:0/1779' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]: dispatch

However it has also caused the memory leak to start.

We've reverted the CAPS and are back to how we were.


Files

0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch (1.69 KB) 0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch Nitzan Mordechai, 09/21/2023 11:37 AM
0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch (2.06 KB) 0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch Nitzan Mordechai, 09/26/2023 11:25 AM
massif.out.3376365.gz (96.8 KB) massif.out.3376365.gz mgr handling rest calls Chris Palmer, 10/17/2023 04:11 PM
20231227-150450.jpg (61.7 KB) 20231227-150450.jpg node exporter shows memory xiaobao wen, 12/27/2023 07:05 AM
mgr_rgw_log.tar.gz (962 KB) mgr_rgw_log.tar.gz log for mgr and rgw xiaobao wen, 12/27/2023 07:42 AM
ceph-mgr-oomcrash-16-2-15.txt (34.5 KB) ceph-mgr-oomcrash-16-2-15.txt A. Saber Shenouda, 04/14/2024 02:42 PM
clipboard-202408091013-r4sro.png (188 KB) clipboard-202408091013-r4sro.png Raimund Sacherer, 08/09/2024 08:13 AM

Related issues 5 (0 open5 closed)

Related to mgr - Bug #68803: Substantial memory leak in RESTFUL "pg dump"Duplicate

Actions
Copied from mgr - Bug #59580: ceph-mgr memory leak (RESTful module, maybe others?)ResolvedNitzan Mordechai

Actions
Copied to mgr - Backport #67643: quincy: ceph-mgr memory leak (RESTful module, maybe others?)ResolvedNitzan MordechaiActions
Copied to mgr - Backport #67644: reef: ceph-mgr memory leak (RESTful module, maybe others?)ResolvedNitzan MordechaiActions
Copied to mgr - Backport #67646: squid: ceph-mgr memory leak (RESTful module, maybe others?)ResolvedNitzan MordechaiActions
Actions #1

Updated by Nitzan Mordechai over 1 year ago

  • Copied from Bug #59580: ceph-mgr memory leak (RESTful module, maybe others?) added
Actions #2

Updated by Nitzan Mordechai over 1 year ago

  • Subject changed from ceph-mgr memory leak (RESTful module, maybe others?) splitted to ceph-mgr memory leak (RESTful module, maybe others?)
  • Status changed from New to Pending Backport
  • Backport changed from pacific quincy reef to quincy,reef,squid
  • Pull request ID set to 54984
  • Tags (freeform) deleted (backport_processed)

This is a split solution that will handle trimming of the requests queue

Actions #3

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67643: quincy: ceph-mgr memory leak (RESTful module, maybe others?) added
Actions #4

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67644: reef: ceph-mgr memory leak (RESTful module, maybe others?) added
Actions #5

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67646: squid: ceph-mgr memory leak (RESTful module, maybe others?) added
Actions #6

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #7

Updated by Konstantin Shalygin over 1 year ago

  • Target version changed from v19.0.0 to v20.0.0
  • % Done changed from 0 to 70
Actions #8

Updated by Konstantin Shalygin over 1 year ago

  • File deleted (Screen2.png)
Actions #9

Updated by Konstantin Shalygin over 1 year ago

  • File deleted (Screen3.png)
Actions #10

Updated by Konstantin Shalygin over 1 year ago

  • File deleted (Screen1.png)
Actions #11

Updated by Konstantin Shalygin about 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 70 to 100
  • Crash signature (v1) updated (diff)
Actions #12

Updated by Nitzan Mordechai about 1 year ago

  • Related to Bug #68803: Substantial memory leak in RESTFUL "pg dump" added
Actions #13

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to b9d6a0075659fd4810cfaa008dcd10f63ed98593
  • Fixed In set to v19.3.0-4348-gb9d6a007565
  • Upkeep Timestamp set to 2025-07-10T14:42:48+00:00
Actions #14

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-4348-gb9d6a007565 to v19.3.0-4348-gb9d6a00756
  • Upkeep Timestamp changed from 2025-07-10T14:42:48+00:00 to 2025-07-14T20:11:22+00:00
Actions #15

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2199
  • Upkeep Timestamp changed from 2025-07-14T20:11:22+00:00 to 2025-11-01T01:31:06+00:00
Actions

Also available in: Atom PDF