Bug #68989
openceph-mgr memory leak in prometheus module
0%
Description
We were operating a cluster with our own build based on version 16.2.10 and did not experience any OOM issues with ceph-mgr.
However, after upgrading to a version built on 16.2.15, we began to encounter OOM problems. For reference, the Prometheus module is enabled.
Typically, these issues occurred within two weeks after starting ceph-mgr.
To investigate, we conducted profiling using heaptrack and discovered that there was a memory leak in get_daemon_health_metrics(https://github.com/ceph/ceph/pull/48843).
We reverted the commits related to get_daemon_health_metrics and rebuilt the version.
It has now been 45 days since we started running ceph-mgr, and it remains stable.
Files
Updated by Konstantin Shalygin over 1 year ago
- Priority changed from Normal to High
- Source set to Community (user)
Updated by Konstantin Shalygin over 1 year ago
- Follows Bug #59580: ceph-mgr memory leak (RESTful module, maybe others?) added
Updated by Konstantin Shalygin over 1 year ago
- Follows Bug #58094: mgr/dashboard: expose slow ops per osd added
Updated by Konstantin Shalygin over 1 year ago
- Assignee set to Pere Díaz Bou
- Target version set to v20.0.0
- Backport set to squid reef quincy
- Affected Versions v17.2.8, v18.2.5, v19.2.1 added
- Affected Versions deleted (
v16.2.15)
Please take a look
Updated by Konstantin Shalygin over 1 year ago
- Crash signature (v1) updated (diff)
Updated by Konstantin Shalygin over 1 year ago
@darchon, judging by the commit author's profile on the tracker, he has been inactive for some time. I don't think anyone will come to figure this out. Could you please make a PR to the pacific branch in order to spread your research to the community? Then, we can trigger CI, to produce repositories with packages that anyone can clone
Updated by Dan van der Ster over 1 year ago
Just seen similar very fast oom on an mgr after upgrading from 16.2.7 to 16.2.15. (mgr memory over 300GB in a few seconds).
@Konstantin Shalygin in your comment https://tracker.ceph.com/issues/59580#note-51 you also appear to have the oom triggered in get_daemon_health_metrics:
Jul 24 17:44:43 example.com ceph-mgr[2399268]: 13: PyDict_New() Jul 24 17:44:43 example.com ceph-mgr[2399268]: 14: (PyFormatter::open_object_section(std::basic_string_view<char, std::char_traits<char> >)+0x1c) [0x564aa43c4b8c] Jul 24 17:44:43 example.com ceph-mgr[2399268]: 15: (ActivePyModules::get_daemon_health_metrics()+0x15d) [0x564aa42e5d9d]
Also, in this ticket you wrote that quincy/reef are also effected. Did any user report these sudden OOM in the mgr after pacific? I saw many reports that this doesn't happen in quincy++.
Updated by Konstantin Shalygin over 1 year ago
Dan van der Ster wrote in #note-8:
Also, in this ticket you wrote that quincy/reef are also effected. Did any user report these sudden OOM in the mgr after pacific? I saw many reports that this doesn't happen in quincy++.
I have, to the best of my ability and capabilities, brought the problem to this ticket. According to my observations, the problem begins after 40 racks in the crush (it doesn't really much_depend on the number of OSD's). Also, by imperial means we found out that most likely, with certain cluster sizes, the existing options.cc values (mostly undocumented) of the next parameters are insufficient
How-to debug - look to fail_fail fields:
root@host:/# while true ; do date ; ceph daemon /var/run/ceph/ceph-mgr.*.asok perf dump | grep fail_fail ; sleep 1s ; done
We increased this options:
mgr dev mgr_osd_bytes 1073741824 (512Mb -> 1024Mb) mgr dev mgr_osd_messages 16384 (8196 -> 16384) mgr advanced ms_dispatch_throttle_bytes 536870912 (100Mb -> 512Mb)
And, without this, one of our ceph-mgr can't produce metrics at all
mgr advanced mgr/prometheus/scrape_interval 30.000000 (15sec -> 30sec)
The cluster is:
cluster:
id: 99876049-8e15-4fc4-a7a7-5cb209b3c945
health: HEALTH_WARN
noout flag(s) set
Some pool(s) have the nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum mon1,mon3,mon2 (age 26h)
mgr: mon1(active, since 27h), standbys: mon2, mon3
osd: 1592 osds: 1592 up (since 17h), 1592 in (since 9d); 6 remapped pgs
flags noout
data:
pools: 7 pools, 46112 pgs
objects: 459.54M objects, 1.6 PiB
usage: 4.8 PiB used, 3.4 PiB / 8.1 PiB avail
pgs: 242079/1378634139 objects misplaced (0.018%)
46036 active+clean
68 active+clean+scrubbing
4 active+remapped+backfill_wait
2 active+remapped+backfilling
2 active+clean+snaptrim
io:
client: 4.0 GiB/s rd, 5.0 GiB/s wr, 109.95k op/s rd, 149.41k op/s wr
recovery: 75 MiB/s, 20 objects/s
root@mon1:/# ceph osd df tree | grep host -c 75 root@mon1:/# ceph osd df tree | grep rack -c 42
P.S.: I can't say anything about Quincy++, we are planning further migration to Pacific, perhaps some other code can be used in year 2026
Updated by Igor Fedotov over 1 year ago
@darchon @Konstantin Shalygin - curious if you can do a custom build with https://github.com/ceph/ceph/pull/59371 backport embedded? Wouldn't that do the trick for you?
Updated by Igor Fedotov over 1 year ago
It looks like get_daemon_health_metrics() isn't a root cause here - it just creates pretty large output in your env and mgr starts to suffer from that more evidently...
Updated by Konstantin Shalygin over 1 year ago
Igor Fedotov wrote in #note-11:
@darchon @Konstantin Shalygin - curious if you can do a custom build with https://github.com/ceph/ceph/pull/59371 backport embedded? Wouldn't that do the trick for you?
Looks like this patch for the rest module, not for the prometheus. We use only prometheus module, another modules are turned off
{
"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator",
"pg_autoscaler",
"progress",
"rbd_support",
"status",
"telemetry",
"volumes"
],
"enabled_modules": [
"prometheus"
]
}
Updated by Konstantin Shalygin over 1 year ago · Edited
Igor Fedotov wrote in #note-12:
It looks like get_daemon_health_metrics() isn't a root cause here - it just creates pretty large output in your env and mgr starts to suffer from that more evidently...
Perhaps, but it seems that the actual data size, in the cohorts, are impossible to estimate using current `perf dump` tooling. At the current moment, from 9GB to 18GB of memory for the ceph-mgr are considered normal for this installation. It is not entirely obvious why there is such a spread, sometimes adding a new host can actually decrease the memory usage. At present, the behavior is not fully understood. This could be done if there is some patch with the ability to disable the scrape of new metrics (`get_all_daemon_health_metrics()`), then it would be possible to estimate the impact, including the size of data at the before-after level.
Per second
14005.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 14082.406250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 14159.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 14796.906250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 15657.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 16507.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 17417.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 17502.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 15519.906250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 13449.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 13578.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph 13446.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
Updated by Yaarit Hatuka over 1 year ago
- Related to Bug #61005: crash: tcmalloc::NewSpan(unsigned long, unsigned long) added
Updated by Nitzan Mordechai about 1 year ago
@Konstantin Shalygin can you try to recreate the oom (or at least the memory increase ) under valgrind massif and attach the report? it will be easier to figure out who spending most of the memory
Updated by Konstantin Shalygin about 1 year ago
Nitzan Mordechai wrote in #note-16:
@Konstantin Shalygin can you try to recreate the oom (or at least the memory increase ) under valgrind massif and attach the report? it will be easier to figure out who spending most of the memory
Can you share manual how-to do this?
Updated by Nitzan Mordechai about 1 year ago
To analyze memory usage with massif, you can start the mgr process with the following command:
valgrind --tool=massif bin/ceph-mgr -i <mgr-id>
Replace <mgr-id> with the appropriate manager ID (e.g., x).
Start the mgr process using the command above.
Perform your usual operations with the manager running. (with the mgr modules on)
Monitor memory usage during this time. Look for any significant or unusual increases in memory allocation.
Stop the manager process once you've observed the behavior or reached the point where enough memory has been allocated.
After stopping the manager, a file named massif.out.<pid> will be generated in the current working directory
Please attach the massif.out.<pid> file so we can analyze it further.
Updated by Nitzan Mordechai about 1 year ago
@Konstantin Shalygin sorry, i missed a tag in my previous comment and you probably didn't got the notification
Updated by Konstantin Shalygin about 1 year ago · Edited
- File massif.out.1345803 massif.out.1345803 added
- File ceph-mgr.mon1.log ceph-mgr.mon1.log added
Nitzan Mordechai wrote in #note-18:
Please attach the massif.out.<pid> file so we can analyze it further.
I was try as you suggested, but ceph-mgr failed on warm-up:
root@mon1:massif# ulimit -n 10240 root@mon1:massif# valgrind --tool=massif /usr/bin/ceph-mgr --cluster ceph --id mon1 --setuser ceph --setgroup ceph ==1345803== Massif, a heap profiler ==1345803== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote ==1345803== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==1345803== Command: /usr/bin/ceph-mgr --cluster ceph --id mon1 --setuser ceph --setgroup ceph ==1345803== ==1345803== ==1345803== could not unlink /tmp/vgdb-pipe-from-vgdb-to-1345803-by-root-on-??? ==1345803== could not unlink /tmp/vgdb-pipe-to-vgdb-from-1345803-by-root-on-??? ==1345803== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-1345803-by-root-on-??? root@mon1:massif# ==1345860== execve(0x165ded50(/proc/self/exe), 0x165dec90, 0x1fff0003c8) failed, errno 2 ==1345860== EXEC FAILED: I can't recover from execve() failing, so I'm dying. ==1345860== Add more stringent tests in PRE(sys_execve), or work out how to recover.
Seems something limited by valgrind itself?
Updated by Konstantin Shalygin about 1 year ago
- Backport changed from squid reef quincy to squid reef
Updated by Nitzan Mordechai 7 months ago
- Status changed from New to In Progress
Updated by Nitzan Mordechai 7 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 65245
Updated by Konstantin Shalygin 7 months ago
- Target version changed from v20.0.0 to v21.0.0
- Backport changed from squid reef to tentacle squid reef
- Affected Versions v19.2.4 added
- Affected Versions deleted (
v18.2.5, v19.2.1)
Updated by Nitzan Mordechai 7 months ago
- Related to Bug #67710: mgr/ActivePyModules: crash while call ActivePyModules::get_daemon_health_metrics() added
Updated by Nitzan Mordechai 4 months ago
- Backport changed from tentacle squid reef to tentacle, squid
Updated by Nitzan Mordechai 4 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot 4 months ago
- Merge Commit set to 8df7e65cde35078c06dcf3386077626146676672
- Fixed In set to v20.3.0-4421-g8df7e65cde
- Upkeep Timestamp set to 2025-12-02T13:33:09+00:00
Updated by Upkeep Bot 4 months ago
- Copied to Backport #74056: tentacle: ceph-mgr memory leak in prometheus module added
Updated by Upkeep Bot 4 months ago
- Copied to Backport #74057: squid: ceph-mgr memory leak in prometheus module added
Updated by Konstantin Shalygin 4 months ago
Backport changed from tentacle squid reef to tentacle, squid
@Nitzan Mordechai why reef removed from backports? Seems last Reef 18.2.8 preparing for release, it will be nice to see this patch also
Updated by Nitzan Mordechai 3 months ago
@darchon did you create that tracker from an existing BZ? I'm trying to find it, and i can't locate it to update the downstream process