rgw/posix: add destructor to BucketCache to fix memory leaks#66850
Merged
mattbenjamin merged 1 commit intoceph:mainfrom Feb 24, 2026
Merged
Conversation
ff2f40f to
ddfc13b
Compare
Contributor
Author
|
jenkins test make check |
Contributor
Author
|
jenkins test make check arm64 |
869d64a to
01023e9
Compare
Contributor
Author
|
this does not work. looking.. |
01023e9 to
0290996
Compare
The RGW POSIX driver's BucketCache had two memory leak issues: 1. BucketCache destructor did not clean up cached entries 2. list_bucket() had missing lru.unref() calls on early return paths Issue 1: Missing destructor cleanup ------------------------------------ BucketCache did not have a destructor to clean up BucketCacheEntry objects at destruction, causing LeakSanitizer to report ~22-31MB of leaks in unittest_rgw_posix_driver and unittest_posix_bucket_cache. The cache is properly bounded at runtime (max 100 entries with LRU eviction), but entries remaining at test shutdown were not freed. Issue 2: Missing unref calls in list_bucket() ---------------------------------------------- The list_bucket() function had three early return paths that failed to call lru.unref() before returning: 1. Line 475: When marker not found (MDB_NOTFOUND) 2. Line 483: When bucket is empty (MDB_NOTFOUND) 3. Line 491: When iteration stops early (!again) This left entries with refcount=2 instead of the expected refcount=1 (sentinel state), preventing proper cleanup in the destructor. Root cause analysis ------------------- Through debugging, we discovered: - Entries created with FLAG_INITIAL start with refcount=2 - list_bucket() should call lru.unref() to decrement to refcount=1 - Missing unref calls left entries with refcount=2 - Destructor's single unref only decremented to refcount=1, not 0 - Entries with refcount=1 were moved to LRU instead of deleted The fix ------- 1. Add BucketCache destructor that: - Stops inotify thread first (prevents heap-use-after-free) - Drains AVL cache and calls lru.unref() on each entry 2. Add lru.unref() calls to all three early return paths in list_bucket() This ensures all entries reach refcount=0 and are properly deleted, eliminating all memory leaks while maintaining cache performance. Signed-off-by: Kefu Chai <k.chai@proxmox.com>
0290996 to
41208b7
Compare
Contributor
Author
thanks for pointing this out. indeed. they were reverted in the latest reversion. |
Contributor
Author
|
@dang @mattbenjamin hi Daniel and Matt, could you please help review this change? |
Contributor
Contributor
|
jenkins test make check arm64 |
mattbenjamin
approved these changes
Feb 24, 2026
Contributor
Author
|
@mattbenjamin hey Matt, thanks for your review and approval! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The RGW POSIX driver's BucketCache was not cleaning up cached BucketCacheEntry objects at destruction, causing LeakSanitizer to report approximately 31MB of leaks in unittest_rgw_posix_driver.
Analysis confirmed the cache is properly bounded at runtime:
However, at test shutdown, entries remaining in the cache were not being freed, causing LSan failures.
Fix by adding a destructor to BucketCache that:
The drain() method (cohort_lru.h:481) safely removes all entries from the AVL cache partitions and calls the supplied lambda on each, allowing proper cleanup without accessing private members.
This eliminates the memory leaks while maintaining the cache's performance characteristics during normal operation.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.