Project

General

Profile

Actions

Bug #59580

closed

ceph-mgr memory leak (RESTful module, maybe others?)

Added by Greg Farnum almost 3 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Category:
restful module
Target version:
% Done:

100%

Source:
Community (user)
Backport:
pacific quincy reef
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.0.0-597-g13e02409a78
Released In:
v19.2.0~956
Upkeep Timestamp:
2025-07-12T13:15:34+00:00

Description

There are two separate reports on the mailing list of memory leaks in the mgr module:

[ceph-users] Memory leak in MGR after upgrading to pacific

After upgrading from Octopus (15.2.17) to Pacific (16.2.12) two days 
ago, I'm noticing that the MGR daemons keep failing over to standby and 
then back every 24hrs.   Watching the output of 'ceph orch ps' I can see 
that the memory consumption of the mgr is steadily growing until it 
becomes unresponsive.

When the mgr becomes unresponsive, tasks such as RESTful calls start to 
fail, and the standby eventually takes over after ~20 minutes. I've 
included a log of memory consumption (in 10 minute intervals) at the end 
of this message. While the cluster recovers during this issue, the loss 
of usage data during the outage, and the fact its occurring is 
problematic.  Any assistance would be appreciated.

Note, this is a cluster that has been upgraded from an original jewel 
based ceph using filestore, through bluestore conversion, container 
conversion, and now to Pacific.    The data below shows memory use with 
three mgr modules enabled:  cephadm, restful, iostat.   By disabling 
iostat, I can reduce the rate of memory consumption increasing to about 
200MB/hr.

[ceph-users] MGR Memory Leak in Restful

We've hit a memory leak in the Manager Restful interface, in versions 
17.2.5 & 17.2.6. On our main production cluster the active MGR grew to 
about 60G until the oom_reaper killed it, causing a successful failover 
and restart of the failed one. We can then see that the problem is 
recurring, actually on all 3 of our clusters.

We've traced this to when we enabled full Ceph monitoring by Zabbix last 
week. The leak is about 20GB per day, and seems to be proportional to 
the number of PGs. For some time we just had the default settings, and 
no memory leak, but had not got around to finding why many of the Zabbix 
items were showing as Access Denied. We traced this to the MGR's MON 
CAPS which were "mon 'profile mgr'".

The MON logs showed recurring:

log_channel(audit) log [DBG] : from='mgr.284576436 192.168.xxx.xxx:0/2356365' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]:  access denied

Changing the MGR CAPS to "mon 'allow *'" and restarting the MGR 
immediately allowed that to work, and all the follow-on REST calls worked.

log_channel(audit) log [DBG] : from='mgr.283590200 192.168.xxx.xxx:0/1779' entity='mgr.host1' cmd=[{"format": "json", "prefix": "pg dump"}]: dispatch

However it has also caused the memory leak to start.

We've reverted the CAPS and are back to how we were.


Files

0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch (1.69 KB) 0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch Nitzan Mordechai, 09/21/2023 11:37 AM
0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch (2.06 KB) 0001-mgr-restful-trim-reslts-finished-and-failed-lists-to.patch Nitzan Mordechai, 09/26/2023 11:25 AM
massif.out.3376365.gz (96.8 KB) massif.out.3376365.gz mgr handling rest calls Chris Palmer, 10/17/2023 04:11 PM
20231227-150450.jpg (61.7 KB) 20231227-150450.jpg node exporter shows memory xiaobao wen, 12/27/2023 07:05 AM
mgr_rgw_log.tar.gz (962 KB) mgr_rgw_log.tar.gz log for mgr and rgw xiaobao wen, 12/27/2023 07:42 AM
ceph-mgr-oomcrash-16-2-15.txt (34.5 KB) ceph-mgr-oomcrash-16-2-15.txt A. Saber Shenouda, 04/14/2024 02:42 PM
Screen2.png (93.9 KB) Screen2.png Screen2.png Konstantin Shalygin, 07/26/2024 11:01 AM
Screen3.png (185 KB) Screen3.png Screen3.png Konstantin Shalygin, 07/26/2024 11:01 AM
Screen1.png (175 KB) Screen1.png Screen1.png Konstantin Shalygin, 07/26/2024 11:01 AM
clipboard-202408091013-r4sro.png (188 KB) clipboard-202408091013-r4sro.png Raimund Sacherer, 08/09/2024 08:13 AM

Related issues 5 (1 open4 closed)

Precedes mgr - Bug #68989: ceph-mgr memory leak in prometheus modulePending BackportNitzan Mordechai

Actions
Copied to mgr - Backport #63977: reef: memory leak (RESTful module, maybe others?)ResolvedKonstantin ShalyginActions
Copied to mgr - Backport #63978: pacific: memory leak (RESTful module, maybe others?)ResolvedKonstantin ShalyginActions
Copied to mgr - Backport #63979: quincy: memory leak (RESTful module, maybe others?)ResolvedNitzan MordechaiActions
Copied to mgr - Bug #67642: ceph-mgr memory leak (RESTful module, maybe others?)ResolvedNitzan Mordechai

Actions
Actions #1

Updated by Simon Fowler almost 3 years ago

I can confirm both aspects of this on our 17.2.5 cluster, with the RESTful module enabled: with the '[mon] allow profile mgr' caps Zabbix was unable to access many items, with '[mon] allow *' Zabbix access issues were resolved; we're also seeing a memory leak with this configuration, of a similar scale - something on the order of 20GB per day.

We were seeing growth up to ~70-80GB after five to seven days, at which point the RESTful api calls started failing or timing out. The mgr processes weren't being killed or failing over on their own, though (these nodes have 512GB RAM, so they weren't anywhere near an OOM situation), and as far as I can tell nothing else was impacted. Manually failing the active mgr seems to free up the memory without restarting the actual mgr process - we're working around the leak at the moment by failing the active mgr every 24 hours, which is clunky but hasn't been causing any issues for us.

Current versions are:

{
    "mon": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 5
    },
    "mgr": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4
    },
    "osd": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 96
    },
    "mds": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3
    },
    "rgw": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4
    },
    "overall": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 112
    }
}

Manager module list:

balancer              on (always on)
crash                 on (always on)
devicehealth          on (always on)
orchestrator          on (always on)
pg_autoscaler         on (always on)
progress              on (always on)
rbd_support           on (always on)
status                on (always on)
telemetry             on (always on)
volumes               on (always on)
cephadm               on            
dashboard             on            
iostat                on            
restful               on            
alerts                -             
diskprediction_local  -             
influx                -             
insights              -             
k8sevents             -             
localpool             -             
mds_autoscaler        -             
mirroring             -             
nfs                   -             
osd_perf_query        -             
osd_support           -             
prometheus            -             
rook                  -             
selftest              -             
snap_schedule         -             
stats                 -             
telegraf              -             
test_orchestrator     -             
zabbix                -

Actions #3

Updated by Radoslaw Zarzynski over 2 years ago

  • Assignee set to Nitzan Mordechai

Hello Nitzan, would you mind taking a look?

Actions #4

Updated by Nitzan Mordechai over 2 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Nitzan Mordechai over 2 years ago

The mgr RESTful module is keeping all the finished requests, in some cases (like pg dump with hundreds of pgs) the output of the request that we are keeping can grow and grow without any limits.
i'm still trying to find a reason why we are keeping the finished results, if anyone has any good reason to keep them after some time please let me know.

Actions #6

Updated by Nizamudeen A over 2 years ago

I don't know if the RESTful module is still used. Dashboard has its own REST API inside the dashboard module which would probably eliminate the need for another RESTful module. I remember some discussion about deprecating or removing the RESTful module. Not sure if its still in plan to deprecate it.

Actions #7

Updated by Nitzan Mordechai over 2 years ago

Nizamudeen A thanks for the input, what about zabbix using that restful api?

Actions #8

Updated by Nitzan Mordechai over 2 years ago

The problem is that we are keeping the results of each request, when the result is small, that's fine, but when it comes to pg dumps and other commands that potentially can return a lot of data, the result array can grow and finally cause OOM

    def finish(self, tag):
        with self.lock:
            for index in range(len(self.running)):
                if self.running[index].tag == tag:
                    if self.running[index].r == 0:
                        self.finished.append(self.running.pop(index))
                    else:
                        self.failed.append(self.running.pop(index))
                    return True

            # No such tag found
            return False

There is no limit on self.failed or self.finished.

Actions #9

Updated by Nitzan Mordechai over 2 years ago

  • Assignee changed from Nitzan Mordechai to Juan Miguel Olmo Martínez
Actions #10

Updated by Michal Cila over 2 years ago

Hi all,

we are experiencing memory leak in MGR as well. However we have restful module turned off and CAPS are set to allow profile mgr. Strange thing is that at one point memory starts growing on multiple Ceph clusters within one datacenter rapidly.

Version 17.2.6

Actions #11

Updated by Chris Palmer over 2 years ago

Any chance of progressing this?

Regardless of whether there is actually any reason for keeping the results (??), continually adding to a memory structure without bounds is not a great idea!

Thanks, Chris

Actions #12

Updated by David Orman over 2 years ago

We're seeing this anywhere from 1-3 times a day to once a week on all clusters running .13 and .14. We externally poll the metrics endpoint frequently, in case it's related. We see this on clusters we do not utilize the RESTFul module with.

Actions #13

Updated by Rok Jaklic over 2 years ago

Greg Farnum wrote:

There are two separate reports on the mailing list of memory leaks in the mgr module:

[ceph-users] Memory leak in MGR after upgrading to pacific
[...]

[ceph-users] MGR Memory Leak in Restful
[...]

It happend us three times in last month.

Likely it changed something in between 16.2.10-16.2.13 ... that causes this to happen. We have two reports that after update from .10 to .13 ceph mgr got oom killed several times.

[root@ctplmon1 bin]# ceph mgr services
{
    "prometheus": "http://xxx:9119/" 
}
{
    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator",
        "pg_autoscaler",
        "progress",
        "rbd_support",
        "status",
        "telemetry",
        "volumes" 
    ],
    "enabled_modules": [
        "iostat",
        "prometheus",
        "restful" 
    ] ...
}
Actions #14

Updated by Nitzan Mordechai over 2 years ago

David Orman wrote:

We're seeing this anywhere from 1-3 times a day to once a week on all clusters running .13 and .14. We externally poll the metrics endpoint frequently, in case it's related. We see this on clusters we do not utilize the RESTFul module with.

That sounds like a different issue if the restful module was not utilized on those mgrs, can you run the mgr with massif profiling and send us the output?

Actions #15

Updated by Nitzan Mordechai over 2 years ago

@Simon Fowler\ @Rok Jaklic \ @Chris Palmer I'm attaching patch, can you please verify it?

Actions #16

Updated by Chris Palmer over 2 years ago

I tried applying this patch to 17.2.6. After applying the CAPS change and restarting the active MGR these type of entries were logged in the MGR log:

2023-09-21T14:21:21.272+0100 7f22f3fba700  0 [restful ERROR root] Traceback (most recent call last):
  File "/lib/python3/dist-packages/pecan/core.py", line 683, in __call__
    self.invoke_controller(controller, args, kwargs, state)
  File "/lib/python3/dist-packages/pecan/core.py", line 574, in invoke_controller
    result = controller(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/decorators.py", line 37, in decorated
    return f(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/api/request.py", line 88, in post
    return context.instance.submit_request([[request.json]], **kwargs)
  File "/usr/share/ceph/mgr/restful/module.py", line 605, in submit_request
    request = CommandsRequest(_request)
  File "/usr/share/ceph/mgr/restful/module.py", line 62, in __init__
    max_finished = cast(int, self.get_localized_module_option('max_finished', 100))
AttributeError: 'CommandsRequest' object has no attribute 'get_localized_module_option'

2023-09-21T14:21:21.280+0100 7f22f3fba700  0 [restful ERROR werkzeug] Error on request:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/werkzeug/serving.py", line 323, in run_wsgi
    execute(self.server.app)
  File "/usr/lib/python3/dist-packages/werkzeug/serving.py", line 312, in execute
    application_iter = app(environ, start_response)
  File "/usr/lib/python3/dist-packages/pecan/middleware/recursive.py", line 56, in __call__
    return self.application(environ, start_response)
  File "/usr/lib/python3/dist-packages/pecan/core.py", line 840, in __call__
    return super(Pecan, self).__call__(environ, start_response)
  File "/usr/lib/python3/dist-packages/pecan/core.py", line 683, in __call__
    self.invoke_controller(controller, args, kwargs, state)
  File "/usr/lib/python3/dist-packages/pecan/core.py", line 574, in invoke_controller
    result = controller(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/decorators.py", line 37, in decorated
    return f(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/api/request.py", line 88, in post
    return context.instance.submit_request([[request.json]], **kwargs)
  File "/usr/share/ceph/mgr/restful/module.py", line 605, in submit_request
    request = CommandsRequest(_request)
  File "/usr/share/ceph/mgr/restful/module.py", line 62, in __init__
    max_finished = cast(int, self.get_localized_module_option('max_finished', 100))
AttributeError: 'CommandsRequest' object has no attribute 'get_localized_module_option'
2023-09-21T14:25:24.095+0100 7f22f3fba700  0 [restful ERROR root] Traceback (most recent call last):
  File "/lib/python3/dist-packages/pecan/core.py", line 683, in __call__
    self.invoke_controller(controller, args, kwargs, state)
  File "/lib/python3/dist-packages/pecan/core.py", line 574, in invoke_controller
    result = controller(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/decorators.py", line 37, in decorated
    return f(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/api/request.py", line 88, in post
    return context.instance.submit_request([[request.json]], **kwargs)
  File "/usr/share/ceph/mgr/restful/module.py", line 605, in submit_request
    return request
  File "/usr/share/ceph/mgr/restful/module.py", line 62, in __init__
    self.lock = threading.RLock()
AttributeError: 'CommandsRequest' object has no attribute 'get_localized_module_option'
Actions #17

Updated by Nitzan Mordechai over 2 years ago

@Chris Palmer, i needed to redo the fix, now it should trim the requests and not let them exceed 5,000 requests. please give it a try.

Actions #18

Updated by Chris Palmer over 2 years ago

Thanks for continuing to look at this. Has this patch been tested against Quincy (17.2.6)? It still has a reference to get_localized_module_option which will probably cause the same errors as above...

Actions #19

Updated by Nitzan Mordechai over 2 years ago

Chris Palmer wrote:

Thanks for continuing to look at this. Has this patch been tested against Quincy (17.2.6)? It still has a reference to get_localized_module_option which will probably cause the same errors as above...

yes, it still has get_localized_module_option, but now it's not under the server, it's under the init of the module itself.

Actions #20

Updated by Chris Palmer over 2 years ago

Doesn't look very promising I'm afraid. I just tried the patch on a small 17.2.6 cluster with 745 pgs and zabbix monitoring enabled and "mon 'allow *'" in the active manager caps. As before the monitoring worked. However the mgr rss started at 393M and 12 hours later was at 2.2G and growing, so around 4GB per 24 hours. This is consistent with the original report where a cluster with about 5 times the number of pgs leaked around 20GB per 24 hours.

Actions #21

Updated by Chris Palmer over 2 years ago

I just had a 17.2.6 MGR on another cluster go OOM. It was on a bigger cluster, with Zabbix integration, but without the 'allow *' CAPS so most of the calls were getting denied. This MGR had been the active one for some months. It all suggests a leak in the REST interface but it is probably not restricted to the pgs dump function.

Actions #22

Updated by Nitzan Mordechai over 2 years ago

Chris Palmer wrote:

I just had a 17.2.6 MGR on another cluster go OOM. It was on a bigger cluster, with Zabbix integration, but without the 'allow *' CAPS so most of the calls were getting denied. This MGR had been the active one for some months. It all suggests a leak in the REST interface but it is probably not restricted to the pgs dump function.

Chris, i'm running tests on my environment and I can see the leaks from the restful module without the patch, with the patch the memory doesn't increass.
i'm starting to wonder if we have also Zabbix module memory leak, can you try the patch after you disabled Zabbix module?

Actions #23

Updated by Chris Palmer over 2 years ago

Ah, slight confusion. We are not using the Ceph Zabbix module. That module runs inside the manager and pushes data to Zabbix, but is not very comprehensive. Instead we use a more recent Zabbix template that runs inside a Zabbix agent on another host, and simply polls the active Ceph MGR using the Rest API. So it is entirely consistent that the leak is in the Rest API itself. Zabbix is simply calling it more often that it might normally be. When I enable that additional CAP it performs a lot more calls which turns a small problem into a larger one.

I suspect that your patch fixes a secondary memory buildup, but that the primary leak hasn't been found yet.

Actions #24

Updated by Nitzan Mordechai over 2 years ago

Chris Palmer wrote:

Ah, slight confusion. We are not using the Ceph Zabbix module. That module runs inside the manager and pushes data to Zabbix, but is not very comprehensive. Instead we use a more recent Zabbix template that runs inside a Zabbix agent on another host, and simply polls the active Ceph MGR using the Rest API. So it is entirely consistent that the leak is in the Rest API itself. Zabbix is simply calling it more often that it might normally be. When I enable that additional CAP it performs a lot more calls which turns a small problem into a larger one.

I suspect that your patch fixes a secondary memory buildup, but that the primary leak hasn't been found yet.

Chris, i had some attempts to recreate the issue, with the patch, but i still couldn't get it to increase quickly as you are seeing.
can you run the mgr with valgrind massif and provide us the output?

Actions #25

Updated by Chris Palmer over 2 years ago

I'll give it a go. I have never used massif before though. Do you have any instructions for how to run the mgr up under it, including any command line options I should use? (It is running non-containerized under debian11).

Actions #26

Updated by Nitzan Mordechai over 2 years ago

I'm using:

valgrind --tool=massif bin/ceph-mgr -f --cluster ceph -i x &

valgrind massif is pretty basic, non argument is needed.

Actions #27

Updated by Chris Palmer over 2 years ago

I ran an active manager under massif for a few minutes, forcing repeated zabbix polling. The memory usage of the mgr could be seen increasing as I was doing this. The massif output is attached. There was little else happening on the cluster at the time, in particular no other mgr activity. Let me know if that is sufficient...

Actions #28

Updated by Nitzan Mordechai over 2 years ago

I'm examining your massif file and trying to come with a fix. i still was not able to cause the memory to leak quick enough. any chance you are using mgr_ttl_cache_expire_seconds ?

Actions #29

Updated by Chris Palmer over 2 years ago

Thanks. There are no mgr-related tuning parameters set, and specifically mgr_ttl_cache_expire_seconds has not been set. Let me know if I can do anything else.

Actions #30

Updated by Adrien Georget over 2 years ago

Any news about this issue?
We have 2 clusters (16.2.14) heavily affected by this memory leak.
Is there a way to limit memory usage for MGR?

Same thing happened with the restful module disabled

    "enabled_modules": [
        "cephadm",
        "dashboard",
        "iostat",
        "nfs",
        "prometheus" 
    ]

Actions #31

Updated by Nitzan Mordechai over 2 years ago

Adrien Georget wrote:

Any news about this issue?
We have 2 clusters (16.2.14) heavily affected by this memory leak.
Is there a way to limit memory usage for MGR?

Same thing happened with the restful module disabled
[...]

I'm still working on that issue, i couldn't make the mgr to leak memory with the above rates.
can you also send massif output with disabled restful module in mgr?

Actions #32

Updated by Andrea Bolzonella over 2 years ago

Adrien Georget wrote:

Is there a way to limit memory usage for MGR?

As a temporary solution, I have limited the memory usage in the systemd unit. The MGR will be killed before it wastes all the memory and compromises the entire node.

Actions #33

Updated by Nitzan Mordechai over 2 years ago

Adrien, can you please add a massif output file from the mgr (one without restful enable) that leaking memory?

Actions #34

Updated by Zakhar Kirpichenko over 2 years ago

Our 16.2.14 cluster is affected as well. Modules enabled:

"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator",
"pg_autoscaler",
"progress",
"rbd_support",
"status",
"telemetry",
"volumes"
],
"enabled_modules": [
"cephadm",
"dashboard",
"iostat",
"prometheus",
"restful"
],

No special mgr settings, all defaults.

Actions #35

Updated by Nitzan Mordechai over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Juan Miguel Olmo Martínez to Nitzan Mordechai
  • Pull request ID set to 54634
Actions #36

Updated by Konstantin Shalygin over 2 years ago

  • Target version set to v19.0.0
  • Source set to Community (user)
  • Backport set to pacific Quincy reef
  • Affected Versions v16.2.14, v17.2.7, v18.2.1 added
Actions #37

Updated by Andrea Bolzonella over 2 years ago

Hello folks

After upgrading to version 16.2.13, we encountered MGR OOM issues. It's worth noting that the restful module was not enabled at the time.
The MGR would experience OOM issues about once a week. The memory usage would start increasing without any apparent reason and in less than an hour, it would take up all 300GiB.
However, when we upgraded to Quincy, the issue disappeared.

Actions #38

Updated by A. Saber Shenouda over 2 years ago

Hi Team,

We are affected as well. Since we upgraded to 16.2.14 on two diff cluster 'ceph-mgr' oom and gets killed. It's random. Happens 3 to 5 times per month. Did not have this issue on 16.2.11.

Updated by xiaobao wen about 2 years ago

Hello.

We suspect we have encountered similar problems.
About 370GB of memory was consumed in five minutes. Oom is not triggered and the server is finally restarted to solve the problem.
After checking the logs, mgr stopped printing logs when the memory decrease started(2023-12-26T02:23). rgw stopped printing logs when memory is exhausted(2023-12-26T02:28).The osd is limited to 16GB of memory.So it is more suspected that mgr has a memory leak.

[root@bd-hdd03-node01 deeproute]# ceph version
ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)

[root@bd-hdd03-node01 deeproute]# ceph mgr module ls
{
    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator",
        "pg_autoscaler",
        "progress",
        "rbd_support",
        "status",
        "telemetry",
        "volumes" 
    ],
    "enabled_modules": [
        "iostat",
        "nfs",
        "prometheus",
        "restful" 
    ],

Actions #40

Updated by Konstantin Shalygin about 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from pacific Quincy reef to pacific quincy reef
Actions #41

Updated by Upkeep Bot about 2 years ago

  • Copied to Backport #63977: reef: memory leak (RESTful module, maybe others?) added
Actions #42

Updated by Upkeep Bot about 2 years ago

  • Copied to Backport #63978: pacific: memory leak (RESTful module, maybe others?) added
Actions #43

Updated by Upkeep Bot about 2 years ago

  • Copied to Backport #63979: quincy: memory leak (RESTful module, maybe others?) added
Actions #45

Updated by A. Saber Shenouda almost 2 years ago

Hi,

It seems that the ceph-mgr oom issue happened again on 16.2.15. We had ceph-mgr "oom" this morning.

I have attached the logs.

Actions #46

Updated by Nitzan Mordechai almost 2 years ago

waitting for https://github.com/ceph/ceph/pull/54984 merge and backport

Actions #47

Updated by Adrien Georget over 1 year ago

Hi,
Same observation, 16.2.15 did not fix this issue.
ceph-mgr still crash with OOM.

Jul 19 03:46:14 ceph-mgr[2711105]: tcmalloc: large alloc 1233903616 bytes == 0x55c9a211c000 @  0x7f5d4c8df760 0x7f5d4c900a62 0x7f5d55ea7988 0x7f5d55ed74d5 0x55c8d30128bb 0x55c8d3012ba0 0x55c8d2f33d9d 0x7f5d55f55057 0x7f5d55f55f48 0x7f5d55f32f08 0x>
Jul 19 03:46:31 ceph-mgr[2711105]: tcmalloc: large alloc 1542381568 bytes == 0x55c933098000 @  0x7f5d4c8df760 0x7f5d4c900a62 0x7f5d55ea7988 0x7f5d55ed74d5 0x55c8d30128bb 0x55c8d3012ba0 0x55c8d2f33d9d 0x7f5d55f55057 0x7f5d55f55f48 0x7f5d55f32f08 0x>
Jul 19 03:46:40 ceph-mgr[2711105]: tcmalloc: large alloc 1927979008 bytes == 0x55c9eb9da000 @  0x7f5d4c8df760 0x7f5d4c900a62 0x7f5d55ea7988 0x7f5d55ed74d5 0x55c8d30128bb 0x55c8d3012ba0 0x55c8d2f33d9d 0x7f5d55f55057 0x7f5d55f55f48 0x7f5d55f32f08 0x>
Jul 19 03:47:04 ceph-mgr[2711105]: tcmalloc: large alloc 2409979904 bytes == 0x55ca5e884000 @  0x7f5d4c8df760 0x7f5d4c900a62 0x7f5d55ea7988 0x7f5d55ed74d5 0x55c8d30128bb 0x55c8d3012ba0 0x55c8d2f33d9d 0x7f5d55f55057 0x7f5d55f55f48 0x7f5d55f32f08 0x>
Jul 19 03:47:44 ceph-mgr[2711105]: tcmalloc: large alloc 3012476928 bytes == 0x55c9a211c000 @  0x7f5d4c8df760 0x7f5d4c900a62 0x7f5d55ea7988 0x7f5d55ed74d5 0x55c8d30128bb 0x55c8d3012ba0 0x55c8d2f33d9d 0x7f5d55f55057 0x7f5d55f55f48 0x7f5d55f32f08 0x>
Jul 19 03:50:41 ceph-mgr[2711105]: tcmalloc: large alloc 3765600256 bytes == 0x55caee2da000 @  0x7f5d4c8df760 0x7f5d4c900a62 0x7f5d55ea7988 0x7f5d55ed74d5 0x55c8d30128bb 0x55c8d3012ba0 0x55c8d2f33d9d 0x7f5d55f55057 0x7f5d55f55f48 0x7f5d55f32f08 0x>
Jul 19 03:51:02 systemd[1]: ceph-mgr@adm20.service: Main process exited, code=killed, status=9/KILL
Jul 19 03:51:02 systemd[1]: ceph-mgr@adm20.service: Failed with result 'signal'.
Jul 19 03:51:12 systemd[1]: ceph-mgr@adm20.service: Service RestartSec=10s expired, scheduling restart.
Jul 19 03:51:12 systemd[1]: ceph-mgr@adm20.service: Scheduled restart job, restart counter is at 1.
Jul 19 03:51:12 systemd[1]: Stopped Ceph cluster manager daemon.
Jul 19 03:51:12 systemd[1]: Started Ceph cluster manager daemon.
Actions #48

Updated by A. Saber Shenouda over 1 year ago

Hi,

Is there a workaround? It happens once or twice every month on 16.2.14 and 16.2.15 .

Actions #49

Updated by Konstantin Shalygin over 1 year ago

  • Subject changed from memory leak (RESTful module, maybe others?) to ceph-mgr memory leak (RESTful module, maybe others?)
  • Crash signature (v1) updated (diff)
Actions #50

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #51

Updated by Konstantin Shalygin over 1 year ago · Edited

  • File Screenshot 2024-07-26 at 13.50.39.png added
  • File Screenshot 2024-07-26 at 13.51.12.png added
  • File Screenshot 2024-07-26 at 13.52.49.png added

xiaobao wen wrote in #note-39:

About 370GB of memory was consumed in five minutes
[...]

We got the same problem on 16.2.15 like xiaobao described. Memory was gone in seconds. I think there is some other problem here (not a leak, but an outcome of memory), maybe the developers can tell - maybe we need to move this problem to another ticket

{
    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator",
        "pg_autoscaler",
        "progress",
        "rbd_support",
        "status",
        "telemetry",
        "volumes" 
    ],
    "enabled_modules": [
        "prometheus" 
    ]
}

Jul 24 17:43:31 example.com ceph-mgr[2399268]: 10.10.1.1 - - [24/Jul/2024:17:43:31] "GET /metrics HTTP/1.1" 200 12356053 "" "Prometheus/2.45.3" 
Jul 24 17:43:46 example.com ceph-mgr[2399268]: 10.10.1.1 - - [24/Jul/2024:17:43:46] "GET /metrics HTTP/1.1" 200 12356049 "" "Prometheus/2.45.3" 
Jul 24 17:44:01 example.com ceph-mgr[2399268]: 10.10.1.1 - - [24/Jul/2024:17:44:01] "GET /metrics HTTP/1.1" 200 12356054 "" "Prometheus/2.45.3" 
Jul 24 17:44:16 example.com ceph-mgr[2399268]: 10.10.1.1 - - [24/Jul/2024:17:44:16] "GET /metrics HTTP/1.1" 200 12356052 "" "Prometheus/2.45.3" 
Jul 24 17:44:43 example.com ceph-mgr[2399268]: src/page_heap_allocator.h:74] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48
Jul 24 17:44:43 example.com ceph-mgr[2399268]: *** Caught signal (Aborted) **
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  in thread 7ff4a1a20700 thread_name:prometheus
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  1: /lib64/libpthread.so.0(+0x12d20) [0x7ff4cbff7d20]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  2: gsignal()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  3: abort()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  4: /lib64/libtcmalloc.so.4(+0x18499) [0x7ff4ccb73499]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  5: (tcmalloc::NewSpan(unsigned long, unsigned long)+0x109) [0x7ff4ccb84eb9]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  6: (tcmalloc::PageHeap::Carve(tcmalloc::Span*, unsigned long)+0x5e) [0x7ff4ccb83cde]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  7: (tcmalloc::PageHeap::New(unsigned long)+0x13) [0x7ff4ccb847d3]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  8: (tcmalloc::CentralFreeList::Populate()+0x59) [0x7ff4ccb826d9]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  9: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x38) [0x7ff4ccb828c8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  10: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x81) [0x7ff4ccb82971]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  11: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))+0x73) [0x7ff4ccb863e3]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  12: /lib64/libpython3.6m.so.1.0(+0xedafa) [0x7ff4d613cafa]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  13: PyDict_New()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  14: (PyFormatter::open_object_section(std::basic_string_view<char, std::char_traits<char> >)+0x1c) [0x564aa43c4b8c]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  15: (ActivePyModules::get_daemon_health_metrics()+0x15d) [0x564aa42e5d9d]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  16: /lib64/libpython3.6m.so.1.0(+0x19bfb7) [0x7ff4d61eafb7]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  17: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  18: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  19: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  20: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  21: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  22: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  23: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  24: /lib64/libpython3.6m.so.1.0(+0xf9a74) [0x7ff4d6148a74]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  25: /lib64/libpython3.6m.so.1.0(+0x19b02f) [0x7ff4d61ea02f]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  26: PyObject_Call()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  27: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  28: /lib64/libpython3.6m.so.1.0(+0xfa3e6) [0x7ff4d61493e6]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  29: /lib64/libpython3.6m.so.1.0(+0x178fb0) [0x7ff4d61c7fb0]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  30: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  31: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  32: _PyFunction_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  33: _PyObject_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  34: /lib64/libpython3.6m.so.1.0(+0x10dbe0) [0x7ff4d615cbe0]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  35: PyObject_Call()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  36: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  37: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  38: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  39: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  40: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  41: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  42: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  43: _PyFunction_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  44: _PyObject_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  45: /lib64/libpython3.6m.so.1.0(+0x10dbe0) [0x7ff4d615cbe0]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  46: PyObject_Call()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  47: /lib64/libpython3.6m.so.1.0(+0x20d3b2) [0x7ff4d625c3b2]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  48: /lib64/libpython3.6m.so.1.0(+0x1b3514) [0x7ff4d6202514]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  49: /lib64/libpthread.so.0(+0x81ca) [0x7ff4cbfed1ca]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  50: clone()
Jul 24 17:44:43 example.com ceph-mgr[2399268]: 2024-07-24T17:44:43.308+0000 7ff4a1a20700 -1 *** Caught signal (Aborted) **
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  in thread 7ff4a1a20700 thread_name:prometheus
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  1: /lib64/libpthread.so.0(+0x12d20) [0x7ff4cbff7d20]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  2: gsignal()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  3: abort()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  4: /lib64/libtcmalloc.so.4(+0x18499) [0x7ff4ccb73499]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  5: (tcmalloc::NewSpan(unsigned long, unsigned long)+0x109) [0x7ff4ccb84eb9]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  6: (tcmalloc::PageHeap::Carve(tcmalloc::Span*, unsigned long)+0x5e) [0x7ff4ccb83cde]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  7: (tcmalloc::PageHeap::New(unsigned long)+0x13) [0x7ff4ccb847d3]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  8: (tcmalloc::CentralFreeList::Populate()+0x59) [0x7ff4ccb826d9]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  9: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x38) [0x7ff4ccb828c8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  10: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x81) [0x7ff4ccb82971]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  11: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, void* (*)(unsigned long))+0x73) [0x7ff4ccb863e3]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  12: /lib64/libpython3.6m.so.1.0(+0xedafa) [0x7ff4d613cafa]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  13: PyDict_New()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  14: (PyFormatter::open_object_section(std::basic_string_view<char, std::char_traits<char> >)+0x1c) [0x564aa43c4b8c]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  15: (ActivePyModules::get_daemon_health_metrics()+0x15d) [0x564aa42e5d9d]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  16: /lib64/libpython3.6m.so.1.0(+0x19bfb7) [0x7ff4d61eafb7]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  17: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  18: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  19: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  20: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  21: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  22: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  23: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  24: /lib64/libpython3.6m.so.1.0(+0xf9a74) [0x7ff4d6148a74]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  25: /lib64/libpython3.6m.so.1.0(+0x19b02f) [0x7ff4d61ea02f]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  26: PyObject_Call()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  27: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  28: /lib64/libpython3.6m.so.1.0(+0xfa3e6) [0x7ff4d61493e6]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  29: /lib64/libpython3.6m.so.1.0(+0x178fb0) [0x7ff4d61c7fb0]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  30: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  31: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  32: _PyFunction_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  33: _PyObject_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  34: /lib64/libpython3.6m.so.1.0(+0x10dbe0) [0x7ff4d615cbe0]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  35: PyObject_Call()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  36: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  37: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  38: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  39: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  40: /lib64/libpython3.6m.so.1.0(+0x178dc8) [0x7ff4d61c7dc8]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  41: /lib64/libpython3.6m.so.1.0(+0x19c257) [0x7ff4d61eb257]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  42: _PyEval_EvalFrameDefault()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  43: _PyFunction_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  44: _PyObject_FastCallDict()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  45: /lib64/libpython3.6m.so.1.0(+0x10dbe0) [0x7ff4d615cbe0]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  46: PyObject_Call()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  47: /lib64/libpython3.6m.so.1.0(+0x20d3b2) [0x7ff4d625c3b2]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  48: /lib64/libpython3.6m.so.1.0(+0x1b3514) [0x7ff4d6202514]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  49: /lib64/libpthread.so.0(+0x81ca) [0x7ff4cbfed1ca]
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  50: clone()
Jul 24 17:44:43 example.com ceph-mgr[2399268]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Screen1.png Screen2.png Screen3.png

Added stack_sig value from ceph-crash

Actions #52

Updated by Konstantin Shalygin over 1 year ago

  • File deleted (Screenshot 2024-07-26 at 13.51.12.png)
Actions #53

Updated by Konstantin Shalygin over 1 year ago

  • File deleted (Screenshot 2024-07-26 at 13.50.39.png)
Actions #54

Updated by Konstantin Shalygin over 1 year ago

  • File deleted (Screenshot 2024-07-26 at 13.52.49.png)
Actions #56

Updated by Raimund Sacherer over 1 year ago

Hello,

we have seen this on a `16.2.7-126.el8cp` cluster as well.

In the span of a year the MGR consumed more and more memory, slowly but continuously growing, it is now using 162G of RSS and has 175 of VSZ allocated.

We see the restful plugin enabled.

Best Regards

Actions #57

Updated by Nitzan Mordechai over 1 year ago

@rsachere@redhat.com i need to backport https://github.com/ceph/ceph/pull/54984 when it will be testes.

@Laura Flores can we add it to one of the tests soon?

Actions #58

Updated by Laura Flores over 1 year ago

Nitzan Mordechai wrote in #note-57:

@rsachere@redhat.com i need to backport https://github.com/ceph/ceph/pull/54984 when it will be testes.

@Laura Flores can we add it to one of the tests soon?

@Nitzan Mordechai sure! I added the core label to it- that part was missing to make sure it gets placed in one of Yuri's batches for core.

Actions #59

Updated by Nitzan Mordechai over 1 year ago

  • Copied to Bug #67642: ceph-mgr memory leak (RESTful module, maybe others?) added
Actions #60

Updated by Beom-Seok Park over 1 year ago

We were operating a cluster with our own build based on version 16.2.10 and did not experience any OOM issues with ceph-mgr.
However, after upgrading to a version built on 16.2.15, we began to encounter OOM problems. For reference, the Prometheus module is enabled.
Typically, these issues occurred within two weeks after starting ceph-mgr.
To investigate, we conducted profiling using heaptrack and discovered that there was a memory leak in get_daemon_health_metrics(https://github.com/ceph/ceph/pull/48843).
We reverted the commits related to get_daemon_health_metrics and rebuilt the version.
It has now been 45 days since we started running ceph-mgr, and it remains stable.

Actions #61

Updated by Konstantin Shalygin over 1 year ago

Beom-Seok Park wrote in #note-60:

We were operating a cluster with our own build based on version 16.2.10 and did not experience any OOM issues with ceph-mgr.
However, after upgrading to a version built on 16.2.15, we began to encounter OOM problems. For reference, the Prometheus module is enabled.
Typically, these issues occurred within two weeks after starting ceph-mgr.
To investigate, we conducted profiling using heaptrack and discovered that there was a memory leak in get_daemon_health_metrics(https://github.com/ceph/ceph/pull/48843).
We reverted the commits related to get_daemon_health_metrics and rebuilt the version.
It has now been 45 days since we started running ceph-mgr, and it remains stable.

Can you create dedicated issue for your investigation please? Seems this exactly like our reports

Actions #62

Updated by Beom-Seok Park over 1 year ago

Konstantin Shalygin wrote in #note-61:

Can you create dedicated issue for your investigation please? Seems this exactly like our reports

I have created a new issue.
https://tracker.ceph.com/issues/68989

Actions #63

Updated by Konstantin Shalygin over 1 year ago

  • Precedes Bug #68989: ceph-mgr memory leak in prometheus module added
Actions #64

Updated by Konstantin Shalygin over 1 year ago

  • Crash signature (v1) updated (diff)
Actions #65

Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions #66

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 13e02409a78c6ebccea7d46c8a5bff7bec61b260
  • Fixed In set to v19.0.0-597-g13e02409a78
  • Released In set to v19.2.0~956
  • Upkeep Timestamp set to 2025-07-12T13:15:34+00:00
Actions

Also available in: Atom PDF