Skip to content

rgw/posix: add destructor to BucketCache to fix memory leaks#66850

Merged
mattbenjamin merged 1 commit intoceph:mainfrom
tchaikov:wip-rgw-posix-driver-silence-asan-warning
Feb 24, 2026
Merged

rgw/posix: add destructor to BucketCache to fix memory leaks#66850
mattbenjamin merged 1 commit intoceph:mainfrom
tchaikov:wip-rgw-posix-driver-silence-asan-warning

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Jan 9, 2026

The RGW POSIX driver's BucketCache was not cleaning up cached BucketCacheEntry objects at destruction, causing LeakSanitizer to report approximately 31MB of leaks in unittest_rgw_posix_driver.

Analysis confirmed the cache is properly bounded at runtime:

  • Cache limited to max_buckets (default 100) entries
  • LRU evicts and deletes entries when cache is full
  • Objects are recycled via Factory::recycle() when possible
  • No unbounded accumulation occurs during normal operation

However, at test shutdown, entries remaining in the cache were not being freed, causing LSan failures.

Fix by adding a destructor to BucketCache that:

  1. Uses the AVL cache's drain() method to remove all entries
  2. Calls lru.unref() on each entry to decrement refcount
  3. Triggers deletion when refcount reaches zero

The drain() method (cohort_lru.h:481) safely removes all entries from the AVL cache partitions and calls the supplied lambda on each, allowing proper cleanup without accessing private members.

This eliminates the memory leaks while maintaining the cache's performance characteristics during normal operation.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@tchaikov tchaikov requested a review from a team as a code owner January 9, 2026 04:00
@github-actions github-actions bot added the rgw label Jan 9, 2026
@tchaikov tchaikov added the tests label Jan 9, 2026
@tchaikov tchaikov force-pushed the wip-rgw-posix-driver-silence-asan-warning branch from ff2f40f to ddfc13b Compare January 9, 2026 04:03
@tchaikov tchaikov mentioned this pull request Jan 9, 2026
14 tasks
@tchaikov
Copy link
Contributor Author

tchaikov commented Jan 9, 2026

jenkins test make check

@tchaikov
Copy link
Contributor Author

tchaikov commented Jan 9, 2026

jenkins test make check arm64

@tchaikov tchaikov force-pushed the wip-rgw-posix-driver-silence-asan-warning branch 2 times, most recently from 869d64a to 01023e9 Compare January 12, 2026 07:00
@tchaikov tchaikov requested a review from a team as a code owner January 12, 2026 07:00
@github-actions github-actions bot added the rbd label Jan 12, 2026
@tchaikov
Copy link
Contributor Author

this does not work. looking..

@tchaikov tchaikov marked this pull request as draft January 12, 2026 07:59
@tchaikov tchaikov removed request for a team January 12, 2026 07:59
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the changes to src/librbd/cache/pwl aren't related and have already been submitted as #66876.

@tchaikov tchaikov force-pushed the wip-rgw-posix-driver-silence-asan-warning branch from 01023e9 to 0290996 Compare January 12, 2026 12:47
@tchaikov tchaikov marked this pull request as ready for review January 12, 2026 12:48
The RGW POSIX driver's BucketCache had two memory leak issues:

1. BucketCache destructor did not clean up cached entries
2. list_bucket() had missing lru.unref() calls on early return paths

Issue 1: Missing destructor cleanup
------------------------------------
BucketCache did not have a destructor to clean up BucketCacheEntry
objects at destruction, causing LeakSanitizer to report ~22-31MB of
leaks in unittest_rgw_posix_driver and unittest_posix_bucket_cache.

The cache is properly bounded at runtime (max 100 entries with LRU
eviction), but entries remaining at test shutdown were not freed.

Issue 2: Missing unref calls in list_bucket()
----------------------------------------------
The list_bucket() function had three early return paths that failed
to call lru.unref() before returning:

1. Line 475: When marker not found (MDB_NOTFOUND)
2. Line 483: When bucket is empty (MDB_NOTFOUND)
3. Line 491: When iteration stops early (!again)

This left entries with refcount=2 instead of the expected refcount=1
(sentinel state), preventing proper cleanup in the destructor.

Root cause analysis
-------------------
Through debugging, we discovered:
- Entries created with FLAG_INITIAL start with refcount=2
- list_bucket() should call lru.unref() to decrement to refcount=1
- Missing unref calls left entries with refcount=2
- Destructor's single unref only decremented to refcount=1, not 0
- Entries with refcount=1 were moved to LRU instead of deleted

The fix
-------
1. Add BucketCache destructor that:
   - Stops inotify thread first (prevents heap-use-after-free)
   - Drains AVL cache and calls lru.unref() on each entry

2. Add lru.unref() calls to all three early return paths in list_bucket()

This ensures all entries reach refcount=0 and are properly deleted,
eliminating all memory leaks while maintaining cache performance.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
@tchaikov tchaikov force-pushed the wip-rgw-posix-driver-silence-asan-warning branch from 0290996 to 41208b7 Compare January 12, 2026 12:52
@tchaikov
Copy link
Contributor Author

It looks like the changes to src/librbd/cache/pwl aren't related and have already been submitted as #66876.

thanks for pointing this out. indeed. they were reverted in the latest reversion.

@cbodley cbodley requested review from dang and mattbenjamin January 12, 2026 15:35
@cbodley cbodley removed the rbd label Jan 12, 2026
@tchaikov
Copy link
Contributor Author

tchaikov commented Feb 9, 2026

@dang @mattbenjamin hi Daniel and Matt, could you please help review this change?

@mattbenjamin
Copy link
Contributor

hi @tchaikov ; this partially overlaps with #66413, but I think you've covered some different issues in addition to adding the destructor, so I would be good with merging this first.

I will exercise this and expect to approve.

@mattbenjamin
Copy link
Contributor

jenkins test make check arm64

Copy link
Contributor

@mattbenjamin mattbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change is correct.

this has passed s3-tests under ASAN,
in combination with #66413 (which fixes at least one invalid memory access), test by @mkogan1

thank you for finding and fixing this!

@mattbenjamin mattbenjamin merged commit 9bc8361 into ceph:main Feb 24, 2026
13 checks passed
@tchaikov tchaikov deleted the wip-rgw-posix-driver-silence-asan-warning branch February 25, 2026 14:50
@tchaikov
Copy link
Contributor Author

@mattbenjamin hey Matt, thanks for your review and approval!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants