Project

General

Profile

Actions

Bug #68989

open

ceph-mgr memory leak in prometheus module

Added by Beom-Seok Park over 1 year ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
High
Category:
prometheus module
Target version:
% Done:

0%

Source:
Community (user)
Backport:
tentacle, squid
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-4421-g8df7e65cde
Released In:
Upkeep Timestamp:
2025-12-02T13:33:09+00:00

Description

We were operating a cluster with our own build based on version 16.2.10 and did not experience any OOM issues with ceph-mgr.
However, after upgrading to a version built on 16.2.15, we began to encounter OOM problems. For reference, the Prometheus module is enabled.
Typically, these issues occurred within two weeks after starting ceph-mgr.
To investigate, we conducted profiling using heaptrack and discovered that there was a memory leak in get_daemon_health_metrics(https://github.com/ceph/ceph/pull/48843).
We reverted the commits related to get_daemon_health_metrics and rebuilt the version.
It has now been 45 days since we started running ceph-mgr, and it remains stable.


Files

massif.out.1345803 (63.3 KB) massif.out.1345803 massif.out.txt Konstantin Shalygin, 01/10/2025 12:57 PM
ceph-mgr.mon1.log (17.5 KB) ceph-mgr.mon1.log ceph-mgr.log.txt Konstantin Shalygin, 01/10/2025 12:57 PM

Related issues 6 (3 open3 closed)

Related to mgr - Bug #61005: crash: tcmalloc::NewSpan(unsigned long, unsigned long)New

Actions
Related to mgr - Bug #67710: mgr/ActivePyModules: crash while call ActivePyModules::get_daemon_health_metrics()Duplicate

Actions
Follows mgr - Bug #59580: ceph-mgr memory leak (RESTful module, maybe others?)ResolvedNitzan Mordechai

Actions
Follows Dashboard - Bug #58094: mgr/dashboard: expose slow ops per osdResolvedPere Díaz Bou

Actions
Copied to mgr - Backport #74056: tentacle: ceph-mgr memory leak in prometheus moduleIn ProgressNitzan MordechaiActions
Copied to mgr - Backport #74057: squid: ceph-mgr memory leak in prometheus moduleIn ProgressNitzan MordechaiActions
Actions #1

Updated by Konstantin Shalygin over 1 year ago

  • Priority changed from Normal to High
  • Source set to Community (user)
Actions #2

Updated by Konstantin Shalygin over 1 year ago

  • Follows Bug #59580: ceph-mgr memory leak (RESTful module, maybe others?) added
Actions #3

Updated by Konstantin Shalygin over 1 year ago

  • Follows Bug #58094: mgr/dashboard: expose slow ops per osd added
Actions #4

Updated by Konstantin Shalygin over 1 year ago

  • Assignee set to Pere Díaz Bou
  • Target version set to v20.0.0
  • Backport set to squid reef quincy
  • Affected Versions v17.2.8, v18.2.5, v19.2.1 added
  • Affected Versions deleted (v16.2.15)

Please take a look

Actions #5

Updated by Konstantin Shalygin over 1 year ago

  • Affected Versions v16.2.15 added
Actions #6

Updated by Konstantin Shalygin over 1 year ago

  • Crash signature (v1) updated (diff)
Actions #7

Updated by Konstantin Shalygin over 1 year ago

@darchon, judging by the commit author's profile on the tracker, he has been inactive for some time. I don't think anyone will come to figure this out. Could you please make a PR to the pacific branch in order to spread your research to the community? Then, we can trigger CI, to produce repositories with packages that anyone can clone

Actions #8

Updated by Dan van der Ster over 1 year ago

Just seen similar very fast oom on an mgr after upgrading from 16.2.7 to 16.2.15. (mgr memory over 300GB in a few seconds).

@Konstantin Shalygin in your comment https://tracker.ceph.com/issues/59580#note-51 you also appear to have the oom triggered in get_daemon_health_metrics:

Jul 24 17:44:43 example.com ceph-mgr[2399268]:  13: PyDict_New()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  14: (PyFormatter::open_object_section(std::basic_string_view<char, std::char_traits<char> >)+0x1c) [0x564aa43c4b8c]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  15: (ActivePyModules::get_daemon_health_metrics()+0x15d) [0x564aa42e5d9d]

Also, in this ticket you wrote that quincy/reef are also effected. Did any user report these sudden OOM in the mgr after pacific? I saw many reports that this doesn't happen in quincy++.

Actions #9

Updated by Dan van der Ster over 1 year ago

  • Assignee deleted (Pere Díaz Bou)
Actions #10

Updated by Konstantin Shalygin over 1 year ago

Dan van der Ster wrote in #note-8:

Also, in this ticket you wrote that quincy/reef are also effected. Did any user report these sudden OOM in the mgr after pacific? I saw many reports that this doesn't happen in quincy++.

I have, to the best of my ability and capabilities, brought the problem to this ticket. According to my observations, the problem begins after 40 racks in the crush (it doesn't really much_depend on the number of OSD's). Also, by imperial means we found out that most likely, with certain cluster sizes, the existing options.cc values ​​​​(mostly undocumented) of the next parameters are insufficient

How-to debug - look to fail_fail fields:

root@host:/# while true ; do date ; ceph daemon /var/run/ceph/ceph-mgr.*.asok perf dump | grep fail_fail ; sleep 1s ; done

We increased this options:

  mgr           dev       mgr_osd_bytes                                   1073741824 (512Mb -> 1024Mb)
  mgr           dev       mgr_osd_messages                                16384 (8196 -> 16384)
  mgr           advanced  ms_dispatch_throttle_bytes                      536870912 (100Mb -> 512Mb)

And, without this, one of our ceph-mgr can't produce metrics at all

  mgr           advanced  mgr/prometheus/scrape_interval                  30.000000 (15sec -> 30sec)


The cluster is:

  cluster:
    id:     99876049-8e15-4fc4-a7a7-5cb209b3c945
    health: HEALTH_WARN
            noout flag(s) set
            Some pool(s) have the nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum mon1,mon3,mon2 (age 26h)
    mgr: mon1(active, since 27h), standbys: mon2, mon3
    osd: 1592 osds: 1592 up (since 17h), 1592 in (since 9d); 6 remapped pgs
         flags noout

  data:
    pools:   7 pools, 46112 pgs
    objects: 459.54M objects, 1.6 PiB
    usage:   4.8 PiB used, 3.4 PiB / 8.1 PiB avail
    pgs:     242079/1378634139 objects misplaced (0.018%)
             46036 active+clean
             68    active+clean+scrubbing
             4     active+remapped+backfill_wait
             2     active+remapped+backfilling
             2     active+clean+snaptrim

  io:
    client:   4.0 GiB/s rd, 5.0 GiB/s wr, 109.95k op/s rd, 149.41k op/s wr
    recovery: 75 MiB/s, 20 objects/s


root@mon1:/# ceph osd df tree | grep host -c
75
root@mon1:/# ceph osd df tree | grep rack -c
42

P.S.: I can't say anything about Quincy++, we are planning further migration to Pacific, perhaps some other code can be used in year 2026

Actions #11

Updated by Igor Fedotov over 1 year ago

@darchon @Konstantin Shalygin - curious if you can do a custom build with https://github.com/ceph/ceph/pull/59371 backport embedded? Wouldn't that do the trick for you?

Actions #12

Updated by Igor Fedotov over 1 year ago

It looks like get_daemon_health_metrics() isn't a root cause here - it just creates pretty large output in your env and mgr starts to suffer from that more evidently...

Actions #13

Updated by Konstantin Shalygin over 1 year ago

Igor Fedotov wrote in #note-11:

@darchon @Konstantin Shalygin - curious if you can do a custom build with https://github.com/ceph/ceph/pull/59371 backport embedded? Wouldn't that do the trick for you?

Looks like this patch for the rest module, not for the prometheus. We use only prometheus module, another modules are turned off

{
    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator",
        "pg_autoscaler",
        "progress",
        "rbd_support",
        "status",
        "telemetry",
        "volumes" 
    ],
    "enabled_modules": [
        "prometheus" 
    ]
}
Actions #14

Updated by Konstantin Shalygin over 1 year ago · Edited

Igor Fedotov wrote in #note-12:

It looks like get_daemon_health_metrics() isn't a root cause here - it just creates pretty large output in your env and mgr starts to suffer from that more evidently...

Perhaps, but it seems that the actual data size, in the cohorts, are impossible to estimate using current `perf dump` tooling. At the current moment, from 9GB to 18GB of memory for the ceph-mgr are considered normal for this installation. It is not entirely obvious why there is such a spread, sometimes adding a new host can actually decrease the memory usage. At present, the behavior is not fully understood. This could be done if there is some patch with the ability to disable the scrape of new metrics (`get_all_daemon_health_metrics()`), then it would be possible to estimate the impact, including the size of data at the before-after level.


Per second

 14005.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 14082.406250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 14159.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 14796.906250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 15657.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 16507.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 17417.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 17502.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 15519.906250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 13449.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 13578.156250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph
 13446.656250 Mb /usr/bin/ceph-mgr -f --cluster ceph --id mon1 --setuser ceph --setgroup ceph

Actions #15

Updated by Yaarit Hatuka over 1 year ago

  • Related to Bug #61005: crash: tcmalloc::NewSpan(unsigned long, unsigned long) added
Actions #16

Updated by Nitzan Mordechai about 1 year ago

@Konstantin Shalygin can you try to recreate the oom (or at least the memory increase ) under valgrind massif and attach the report? it will be easier to figure out who spending most of the memory

Actions #17

Updated by Konstantin Shalygin about 1 year ago

Nitzan Mordechai wrote in #note-16:

@Konstantin Shalygin can you try to recreate the oom (or at least the memory increase ) under valgrind massif and attach the report? it will be easier to figure out who spending most of the memory

Can you share manual how-to do this?

Actions #18

Updated by Nitzan Mordechai about 1 year ago

To analyze memory usage with massif, you can start the mgr process with the following command:

valgrind --tool=massif bin/ceph-mgr -i <mgr-id>

Replace <mgr-id> with the appropriate manager ID (e.g., x).

Start the mgr process using the command above.
Perform your usual operations with the manager running. (with the mgr modules on)
Monitor memory usage during this time. Look for any significant or unusual increases in memory allocation.
Stop the manager process once you've observed the behavior or reached the point where enough memory has been allocated.
After stopping the manager, a file named massif.out.<pid> will be generated in the current working directory

Please attach the massif.out.<pid> file so we can analyze it further.

Actions #19

Updated by Nitzan Mordechai about 1 year ago

@Konstantin Shalygin sorry, i missed a tag in my previous comment and you probably didn't got the notification

Updated by Konstantin Shalygin about 1 year ago · Edited

Nitzan Mordechai wrote in #note-18:

Please attach the massif.out.<pid> file so we can analyze it further.

I was try as you suggested, but ceph-mgr failed on warm-up:

root@mon1:massif# ulimit -n 10240
root@mon1:massif# valgrind --tool=massif /usr/bin/ceph-mgr --cluster ceph --id mon1 --setuser ceph --setgroup ceph
==1345803== Massif, a heap profiler
==1345803== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote
==1345803== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==1345803== Command: /usr/bin/ceph-mgr --cluster ceph --id mon1 --setuser ceph --setgroup ceph
==1345803==
==1345803==
==1345803== could not unlink /tmp/vgdb-pipe-from-vgdb-to-1345803-by-root-on-???
==1345803== could not unlink /tmp/vgdb-pipe-to-vgdb-from-1345803-by-root-on-???
==1345803== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-1345803-by-root-on-???
root@mon1:massif# ==1345860== execve(0x165ded50(/proc/self/exe), 0x165dec90, 0x1fff0003c8) failed, errno 2
==1345860== EXEC FAILED: I can't recover from execve() failing, so I'm dying.
==1345860== Add more stringent tests in PRE(sys_execve), or work out how to recover.

Seems something limited by valgrind itself?

Actions #21

Updated by Konstantin Shalygin about 1 year ago

  • Backport changed from squid reef quincy to squid reef
Actions #22

Updated by Nitzan Mordechai 8 months ago

  • Assignee set to Nitzan Mordechai
Actions #23

Updated by Nitzan Mordechai 7 months ago

  • Status changed from New to In Progress
Actions #24

Updated by Nitzan Mordechai 7 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 65245
Actions #25

Updated by Konstantin Shalygin 7 months ago

  • Target version changed from v20.0.0 to v21.0.0
  • Backport changed from squid reef to tentacle squid reef
  • Affected Versions v19.2.4 added
  • Affected Versions deleted (v18.2.5, v19.2.1)
Actions #26

Updated by Konstantin Shalygin 7 months ago

  • Affected Versions v18.2.7 added
Actions #27

Updated by Nitzan Mordechai 7 months ago

  • Related to Bug #67710: mgr/ActivePyModules: crash while call ActivePyModules::get_daemon_health_metrics() added
Actions #28

Updated by Nitzan Mordechai 4 months ago

  • Backport changed from tentacle squid reef to tentacle, squid
Actions #29

Updated by Nitzan Mordechai 4 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #30

Updated by Upkeep Bot 4 months ago

  • Merge Commit set to 8df7e65cde35078c06dcf3386077626146676672
  • Fixed In set to v20.3.0-4421-g8df7e65cde
  • Upkeep Timestamp set to 2025-12-02T13:33:09+00:00
Actions #31

Updated by Upkeep Bot 4 months ago

  • Copied to Backport #74056: tentacle: ceph-mgr memory leak in prometheus module added
Actions #32

Updated by Upkeep Bot 4 months ago

  • Copied to Backport #74057: squid: ceph-mgr memory leak in prometheus module added
Actions #33

Updated by Upkeep Bot 4 months ago

  • Tags (freeform) set to backport_processed
Actions #34

Updated by Konstantin Shalygin 4 months ago

Backport changed from tentacle squid reef to tentacle, squid

@Nitzan Mordechai why reef removed from backports? Seems last Reef 18.2.8 preparing for release, it will be nice to see this patch also

Actions #35

Updated by Nitzan Mordechai 3 months ago

@darchon did you create that tracker from an existing BZ? I'm trying to find it, and i can't locate it to update the downstream process

Actions

Also available in: Atom PDF