Skip to content

mgr: fix PyObject* refcounting in TTLCache and cleanup logic#65245

Merged
NitzanMordhai merged 2 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-mgr-api-refcount-memory-leak-fixes
Dec 2, 2025
Merged

mgr: fix PyObject* refcounting in TTLCache and cleanup logic#65245
NitzanMordhai merged 2 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-mgr-api-refcount-memory-leak-fixes

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented Aug 26, 2025

Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and erase / clear did not correctly decref the values, leading
to use-after-free or leaks depending on usage.

Changes:

  • Move Py_INCREF from cacheable_get_python() to TTLCache::insert()
  • Add TTLCache::clear() method for proper memory cleanup
  • Ensure TTLCache::get() returns a new reference
  • Fix misuse of std::move on c_str() in PyJSONFormatter

These changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.

Fixes: https://tracker.ceph.com/issues/68989
Signed-off-by: Nitzan Mordechai nmordech@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, Nitzan! Just left a few comments and questions. Hope they're useful.

import threading
import time
import enum
import gc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of bringing Python's explicit gc into play (which can be a risky guest), we could use Python's weakref to avoid increasing the ref counting on the Python side. For example, that healthcheck data structure could be a weakref.WeakValueDictionary object.

I thought that the functools cache helpers already used weak refs, but I was wrong. It's the non-standard cachetools library the one that uses it (for key values as it uses it for function memoization).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added explicit gc.collect() to maintain low memory footprint in long-lived mgr sessions. That said, I’m open to exploring weakref-based caches or LRU strategies in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invoking a gc.collect() in Python triggers a complete scan of all in-memory data, including deeply nested dicts/large collections. If we're not keeping complex data structures with circular references (where weakrefs are usually the best approach), there's no benefit in invoking gc.collect(). Ref-counting is enough to deallocate resources. Is this based on a real benchmarking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the gc.collect() now after changing the refcounts in the Cache itself. According to the massif output, the memory that the mgr api no longer leaks that memory

# Sort by last_seen and remove oldest entries
sorted_checks = sorted(self.healthcheck.values(), key=lambda x: x.last_seen)
for check in sorted_checks[:len(self.healthcheck) - self.max_entries]:
self.healthcheck.pop(check.name, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using functools LRU cache (with weakrefs)? With that we don't need TTL (weakrefs will be voided on the mgr C++ side), and that cache explicitly tackles the size count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epuertat sorry, i missed that one, can you please explain what you mean by that?

This is for historical health monitoring - the mgr C++ side gives us the current state, but we want to maintain history of health checks over time. We need to prevent the history from growing beyond max_entries, and currently we also expire old entries after x time.

I think we could get rid of the TTL and only maintain max_entries limit - is that what you mean? Or are you suggesting a different approach for the historical tracking itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was suggesting that we simplified that struct. What about a collections.deque? It's basically a fixed-size FIFO queue. I assume that the events are inserted in order (maybe I'm wrong here), so if that's true, then we don't need an expiration time, just the regular FIFO eviction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

TTLCache(uint16_t ttl_ = 0, uint16_t size = UINT16_MAX, float spread = 0.25)
: TTLCacheBase<Key, PyObject*>(ttl_, size, spread) {}
~TTLCache(){};
~TTLCache(){ clear(); };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement this method for the base Cache instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epuertat i tried to move it to base cache, the problem is that Cache is implemented with <key, value> and not <key, PyObject*>, that means we need to duplicate it with <key, PyObject*> or add check for value == PyObject* and that starting to be more complicated then implement it in the TTLCache<Key, PyObject*>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought that the base class was templated to support arbitrary key/value types.

@NitzanMordhai NitzanMordhai changed the title mgr/TTLCache: fix PyObject* lifetime management and cleanup logic mgr: fix PyObject* refcounting in TTLCache and cleanup logic Aug 28, 2025
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch from 0757719 to 52b83eb Compare August 29, 2025 08:17
@NitzanMordhai
Copy link
Contributor Author

@epuertat Thanks a lot for reviewing! i fixed and repushed, please take a look

@NitzanMordhai NitzanMordhai requested a review from epuertat August 29, 2025 08:19
These limits are configurable via the following runtime options:

``mgr/prometheus/healthcheck_history_max_entries`` - the maximum number of health check events to retain in memory (default: 1000).
``mgr/prometheus/heal thcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``mgr/prometheus/heal thcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600).
``mgr/prometheus/healthcheck_history_stale_ttl`` - time-to-live (in seconds) for inactive health checks before they are pruned (default: 3600).

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch from 52b83eb to 55fbc4d Compare August 29, 2025 09:21
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch 4 times, most recently from 7daac31 to 679b2fb Compare September 1, 2025 12:27
Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks good!

Comment on lines +206 to 226
with self.lock:
changes_made = False
names = set(self.healthcheck) | set(current_checks)

for name in names:
present = name in current_checks
check = self.healthcheck.get(name)
if check is None:
if present:
info = current_checks[name]
self.healthcheck[name] = HealthCheckEvent(
name=name,
severity=info.get('severity'),
first_seen=now,
last_seen=now,
count=1,
active=True
)
changes_made = True

continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?? Sorry for the misleading suggestion: I just realized that the self.healthcheck cache is used for random access (by check name), so the deque won't work, since it's index-based (does it?? 🤯 ).

We would need a mixture of map/dict with a queue, something like:

from collections import OrderedDict

class LRUCacheDict(OrderedDict):
    def __init__(self, maxsize, *args, **kwargs):
        self.maxsize = maxsize
        super().__init__(*args, **kwargs)

    def __setitem__(self, key, value):
        if key in self:
            del self[key]  # refresh position
        elif len(self) >= self.maxsize:
            self.popitem(last=False)  # drop oldest
        super().__setitem__(key, value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epuertat ok, i'll redo the LRUCacheDict!

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch 2 times, most recently from bad51b3 to 68a6bc8 Compare September 1, 2025 13:54
@SrinivasaBharath
Copy link
Contributor

@NitzanMordhai - Please resolve check issues, and feel free to proceed with merging the PR.

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch from 0362541 to e2bb649 Compare December 1, 2025 13:12
@NitzanMordhai
Copy link
Contributor Author

jenkins test make check arm64

@NitzanMordhai
Copy link
Contributor Author

jenkins test make check

This patch introduces several improvements to the Prometheus module:

 - Introduces `HealthHistory._prune()` to drop stale and inactive health checks.
  Limits the in-memory healthcheck dict to a configurable max_entries (default 1000).
  TTL for stale entries is configurable via `healthcheck_history_stale_ttl` (default 3600s).

 - Refactors HealthHistory.check() to use a unified iteration over known and current checks,
  improving concurrency and minimizing redundant updates.

 - Use cherrypy.tools.gzip instead of manual gzip.compress() for cleaner
  HTTP compression with proper header handling and client negotiation.

 - Introduces new module options:
    - `healthcheck_history_max_entries`

 - Add proper error handling for CherryPy engine startup failures
 - Remove os._exit monkey patch in favor of proper exception handling
 - Remove manual Content-Type header setting (CherryPy handles automatically)

Fixes: https://tracker.ceph.com/issues/68989
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
Fix incorrect reference counting and memory retention behavior in TTLCache
when storing PyObject* values.
Previously, TTLCache::insert did not increment the reference count,
and `erase` / `clear` did not correctly decref the values, leading
to use-after-free or leaks depending on usage.

Changes:
- Move Py_INCREF from cacheable_get_python() to TTLCache::insert()
- Add `TTLCache::clear()` method for proper memory cleanup
- Ensure TTLCache::get() returns a new reference
- Fix misuse of std::move on c_str() in PyJSONFormatter

These changes prevent both memory leaks and use-after-free errors when
mgr modules use cached Python objects logic.

Fixes: https://tracker.ceph.com/issues/68989
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch from e2bb649 to 7fadf6a Compare December 2, 2025 08:52
@NitzanMordhai NitzanMordhai merged commit 8df7e65 into ceph:main Dec 2, 2025
13 checks passed
@NitzanMordhai NitzanMordhai deleted the wip-nitzan-mgr-api-refcount-memory-leak-fixes branch December 2, 2025 13:31
@ljflores
Copy link
Member

ljflores commented Dec 8, 2025

Hey @NitzanMordhai can you TAL at these two new issues?

https://tracker.ceph.com/issues/74148
https://tracker.ceph.com/issues/74149

Somehow they were missed in the QA review. For sure the second issue is related, and I'm pretty sure the first is as well.

NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Dec 9, 2025
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Dec 11, 2025
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Dec 11, 2025
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Dec 11, 2025
PR ceph#65245 drop the header set for standby module,
we should still have it.

Fixes: https://tracker.ceph.com/issues/74149
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Dec 18, 2025
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Dec 18, 2025
PR ceph#65245 drop the header set for standby module,
we should still have it.

Fixes: https://tracker.ceph.com/issues/74149
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 4, 2026
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 4, 2026
PR ceph#65245 drop the header set for standby module,
we should still have it.

Fixes: https://tracker.ceph.com/issues/74149
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 4, 2026
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 4, 2026
PR ceph#65245 drop the header set for standby module,
we should still have it.

Fixes: https://tracker.ceph.com/issues/74149
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 6, 2026
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 6, 2026
PR ceph#65245 drop the header set for standby module,
we should still have it.

Fixes: https://tracker.ceph.com/issues/74149
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
NitzanMordhai added a commit to NitzanMordhai/ceph that referenced this pull request Jan 7, 2026
The HealthHistory.check() method acquires the lock and then calls
HealthHistory.save(), which also tries to acquire the same lock.
With a regular Lock(), the same thread blocks trying to re-acquire it (deadlock).
Switch to RLock to allow nested acquisition by the same thread.
PR ceph#65245 added the locks.

Fixes: https://tracker.ceph.com/issues/74148
Signed-off-by: Nitzan Mordechai <nmordech@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants