Bug #68989: ceph-mgr memory leak in prometheus module - mgr - Ceph

Actions

Copy link

Bug #68989

open

ceph-mgr memory leak in prometheus module

Added by Beom-Seok Park over 1 year ago. Updated 3 months ago.

Status:

Pending Backport

Priority:

High

Assignee:

Nitzan Mordechai

Category:

prometheus module

Target version:

Ceph - v21.0.0

% Done:

Source:

Community (user)

Backport:

tentacle, squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.15, Ceph - v17.2.8, Ceph - v18.2.7, Ceph - v19.2.4

ceph-qa-suite:

Pull request ID:

65245

Tags (freeform):

backport_processed

Merge Commit:

8df7e65cde35078c06dcf3386077626146676672

Fixed In:

v20.3.0-4421-g8df7e65cde

Released In:

Upkeep Timestamp:

2025-12-02T13:33:09+00:00

Description

We were operating a cluster with our own build based on version 16.2.10 and did not experience any OOM issues with ceph-mgr.
However, after upgrading to a version built on 16.2.15, we began to encounter OOM problems. For reference, the Prometheus module is enabled.
Typically, these issues occurred within two weeks after starting ceph-mgr.
To investigate, we conducted profiling using heaptrack and discovered that there was a memory leak in get_daemon_health_metrics(https://github.com/ceph/ceph/pull/48843).
We reverted the commits related to get_daemon_health_metrics and rebuilt the version.
It has now been 45 days since we started running ceph-mgr, and it remains stable.

Files

Download all files

massif.out.1345803 (63.3 KB) massif.out.1345803	massif.out.txt	Konstantin Shalygin, 01/10/2025 12:57 PM
ceph-mgr.mon1.log (17.5 KB) ceph-mgr.mon1.log	ceph-mgr.log.txt	Konstantin Shalygin, 01/10/2025 12:57 PM

Related issues 6 (3 open — 3 closed)

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Priority changed from Normal to High
Source set to Community (user)

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Follows Bug #59580: ceph-mgr memory leak (RESTful module, maybe others?) added

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Follows Bug #58094: mgr/dashboard: expose slow ops per osd added

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Assignee set to Pere Díaz Bou
Target version set to v20.0.0
Backport set to squid reef quincy
Affected Versions v17.2.8, v18.2.5, v19.2.1 added
Affected Versions deleted (~~v16.2.15~~)

Please take a look

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Affected Versions v16.2.15 added

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

Crash signature (v1) updated (diff)

Actions

Copy link

Updated by Konstantin Shalygin over 1 year ago

@darchon, judging by the commit author's profile on the tracker, he has been inactive for some time. I don't think anyone will come to figure this out. Could you please make a PR to the pacific branch in order to spread your research to the community? Then, we can trigger CI, to produce repositories with packages that anyone can clone

Actions

Copy link

Updated by Dan van der Ster over 1 year ago

Just seen similar very fast oom on an mgr after upgrading from 16.2.7 to 16.2.15. (mgr memory over 300GB in a few seconds).

@Konstantin Shalygin in your comment https://tracker.ceph.com/issues/59580#note-51 you also appear to have the oom triggered in get_daemon_health_metrics:

Jul 24 17:44:43 example.com ceph-mgr[2399268]:  13: PyDict_New()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  14: (PyFormatter::open_object_section(std::basic_string_view<char, std::char_traits<char> >)+0x1c) [0x564aa43c4b8c]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  15: (ActivePyModules::get_daemon_health_metrics()+0x15d) [0x564aa42e5d9d]

Also, in this ticket you wrote that quincy/reef are also effected. Did any user report these sudden OOM in the mgr after pacific? I saw many reports that this doesn't happen in quincy++.

Actions

Copy link

Updated by Dan van der Ster over 1 year ago

Assignee deleted (~~Pere Díaz Bou~~)

Actions

Copy link

#10

Updated by Konstantin Shalygin over 1 year ago

Dan van der Ster wrote in #note-8:

Also, in this ticket you wrote that quincy/reef are also effected. Did any user report these sudden OOM in the mgr after pacific? I saw many reports that this doesn't happen in quincy++.

I have, to the best of my ability and capabilities, brought the problem to this ticket. According to my observations, the problem begins after 40 racks in the crush (it doesn't really much_depend on the number of OSD's). Also, by imperial means we found out that most likely, with certain cluster sizes, the existing options.cc values (mostly undocumented) of the next parameters are insufficient

How-to debug - look to fail_fail fields:

root@host:/# while true ; do date ; ceph daemon /var/run/ceph/ceph-mgr.*.asok perf dump | grep fail_fail ; sleep 1s ; done

We increased this options:

  mgr           dev       mgr_osd_bytes                                   1073741824 (512Mb -> 1024Mb)
  mgr           dev       mgr_osd_messages                                16384 (8196 -> 16384)
  mgr           advanced  ms_dispatch_throttle_bytes                      536870912 (100Mb -> 512Mb)

And, without this, one of our ceph-mgr can't produce metrics at all

  mgr           advanced  mgr/prometheus/scrape_interval                  30.000000 (15sec -> 30sec)

The cluster is:

  cluster:
    id:     99876049-8e15-4fc4-a7a7-5cb209b3c945
    health: HEALTH_WARN
            noout flag(s) set
            Some pool(s) have the nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum mon1,mon3,mon2 (age 26h)
    mgr: mon1(active, since 27h), standbys: mon2, mon3
    osd: 1592 osds: 1592 up (since 17h), 1592 in (since 9d); 6 remapped pgs
         flags noout

  data:
    pools:   7 pools, 46112 pgs
    objects: 459.54M objects, 1.6 PiB
    usage:   4.8 PiB used, 3.4 PiB / 8.1 PiB avail
    pgs:     242079/1378634139 objects misplaced (0.018%)
             46036 active+clean
             68    active+clean+scrubbing
             4     active+remapped+backfill_wait
             2     active+remapped+backfilling
             2     active+clean+snaptrim

  io:
    client:   4.0 GiB/s rd, 5.0 GiB/s wr, 109.95k op/s rd, 149.41k op/s wr
    recovery: 75 MiB/s, 20 objects/s

root@mon1:/# ceph osd df tree | grep host -c
75
root@mon1:/# ceph osd df tree | grep rack -c
42

P.S.: I can't say anything about Quincy++, we are planning further migration to Pacific, perhaps some other code can be used in year 2026

Actions

Copy link

#11

Updated by Igor Fedotov over 1 year ago

@darchon @Konstantin Shalygin - curious if you can do a custom build with https://github.com/ceph/ceph/pull/59371 backport embedded? Wouldn't that do the trick for you?

Actions

Copy link

#12

Updated by Igor Fedotov over 1 year ago

It looks like get_daemon_health_metrics() isn't a root cause here - it just creates pretty large output in your env and mgr starts to suffer from that more evidently...

Actions

Copy link

#13

Updated by Konstantin Shalygin over 1 year ago

Igor Fedotov wrote in #note-11:

@darchon @Konstantin Shalygin - curious if you can do a custom build with https://github.com/ceph/ceph/pull/59371 backport embedded? Wouldn't that do the trick for you?

Looks like this patch for the rest module, not for the prometheus. We use only prometheus module, another modules are turned off

{
    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator",
        "pg_autoscaler",
        "progress",
        "rbd_support",
        "status",
        "telemetry",
        "volumes" 
    ],
    "enabled_modules": [
        "prometheus" 
    ]
}

Actions

Copy link

#14

Updated by Konstantin Shalygin over 1 year ago · Edited

Igor Fedotov wrote in #note-12:

It looks like get_daemon_health_metrics() isn't a root cause here - it just creates pretty large output in your env and mgr starts to suffer from that more evidently...

Perhaps, but it seems that the actual data size, in the cohorts, are impossible to estimate using current `perf dump` tooling. At the current moment, from 9GB to 18GB of memory for the ceph-mgr are considered normal for this installation. It is not entirely obvious why there is such a spread, sometimes adding a new host can actually decrease the memory usage. At present, the behavior is not fully understood. This could be done if there is some patch with the ability to disable the scrape of new metrics (`get_all_daemon_health_metrics()`), then it would be possible to estimate the impact, including the size of data at the before-after level.

Per second

 14005.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 14082.406250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 14159.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 14796.906250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 15657.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 16507.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 17417.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 17502.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 15519.906250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 13449.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 13578.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 13446.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph

Actions

Copy link

#15

Updated by Yaarit Hatuka over 1 year ago

Related to Bug #61005: crash: tcmalloc::NewSpan(unsigned long, unsigned long) added

Actions

Copy link

#16

Updated by Nitzan Mordechai about 1 year ago

@Konstantin Shalygin can you try to recreate the oom (or at least the memory increase ) under valgrind massif and attach the report? it will be easier to figure out who spending most of the memory

Actions

Copy link

#17

Updated by Konstantin Shalygin about 1 year ago

Nitzan Mordechai wrote in #note-16:

@Konstantin Shalygin can you try to recreate the oom (or at least the memory increase ) under valgrind massif and attach the report? it will be easier to figure out who spending most of the memory

Can you share manual how-to do this?

Actions

Copy link

#18

Updated by Nitzan Mordechai about 1 year ago

To analyze memory usage with massif, you can start the mgr process with the following command:

valgrind --tool=massif bin/ceph-mgr -i <mgr-id>

Replace <mgr-id> with the appropriate manager ID (e.g., x).

Start the mgr process using the command above.
Perform your usual operations with the manager running. (with the mgr modules on)
Monitor memory usage during this time. Look for any significant or unusual increases in memory allocation.
Stop the manager process once you've observed the behavior or reached the point where enough memory has been allocated.
After stopping the manager, a file named massif.out.<pid> will be generated in the current working directory

Please attach the massif.out.<pid> file so we can analyze it further.

Actions

Copy link

#19

Updated by Nitzan Mordechai about 1 year ago

@Konstantin Shalygin sorry, i missed a tag in my previous comment and you probably didn't got the notification

Actions

Copy link Download all files

#20

Updated by Konstantin Shalygin about 1 year ago · Edited

File massif.out.1345803 massif.out.1345803 added
File ceph-mgr.mon1.log ceph-mgr.mon1.log added

Nitzan Mordechai wrote in #note-18:

Please attach the massif.out.<pid> file so we can analyze it further.

I was try as you suggested, but ceph-mgr failed on warm-up:

root@mon1:massif# ulimit -n 10240
root@mon1:massif# valgrind --tool=massif /usr/bin/ceph-mgr --cluster ceph --id mon1 --setuser ceph --setgroup ceph
==1345803== Massif, a heap profiler
==1345803== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote
==1345803== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==1345803== Command: /usr/bin/ceph-mgr --cluster ceph --id mon1 --setuser ceph --setgroup ceph
==1345803==
==1345803==
==1345803== could not unlink /tmp/vgdb-pipe-from-vgdb-to-1345803-by-root-on-???
==1345803== could not unlink /tmp/vgdb-pipe-to-vgdb-from-1345803-by-root-on-???
==1345803== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-1345803-by-root-on-???
root@mon1:massif# ==1345860== execve(0x165ded50(/proc/self/exe), 0x165dec90, 0x1fff0003c8) failed, errno 2
==1345860== EXEC FAILED: I can't recover from execve() failing, so I'm dying.
==1345860== Add more stringent tests in PRE(sys_execve), or work out how to recover.

Seems something limited by valgrind itself?

Actions

Copy link

#21