librbd/pwl: fix memory leaks in discard operations#66876
Conversation
|
This is a pretty serious memory leak, so we would want to backport. I have opened https://tracker.ceph.com/issues/74972, please add the corresponding |
Thanks for opening the tracker issue. I've added |
e2873e3 to
9cd9b68
Compare
Fix memory leak in librbd persistent write log (PWL) cache discard
operations by properly completing request objects.
ASan reported the following leaks in unittest_librbd:
Direct leak of 240 byte(s) in 1 object(s) allocated from:
#0 operator new(unsigned long)
#1 librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::discard(...)
/ceph/src/librbd/cache/pwl/AbstractWriteLog.cc:935:5
#2 TestMockCacheReplicatedWriteLog_discard_Test::TestBody()
/ceph/src/test/librbd/cache/pwl/test_mock_ReplicatedWriteLog.cc:534:7
Plus multiple indirect leaks totaling 2,076 bytes through the
shared_ptr reference chain.
Root cause:
C_DiscardRequest objects were never deleted because their complete()
method was never called. The on_write_persist callback released the
BlockGuard cell but didn't call complete() to trigger self-deletion.
Write requests use WriteLogOperationSet which takes the request as
its on_finish callback, ensuring complete() is eventually called.
Discard requests don't use WriteLogOperationSet and must explicitly
call complete() in their on_write_persist callback.
Solution:
Call discard_req->complete(r) in the on_write_persist callback and
move cell release into finish_req() -- mirroring how C_WriteRequest
handles it. The complete() -> finish() -> finish_req() chain ensures
the cell is released after the user request is completed, preserving
the same ordering as write requests.
Test results:
- Before: 2,316 bytes leaked in 15 allocations
- After: 0 bytes leaked
- unittest_librbd discard tests pass successfully with ASan
Fixes: https://tracker.ceph.com/issues/74972
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
9cd9b68 to
fcda31a
Compare
|
changelog:
|
|
This is an automated message by src/script/redmine-upkeep.py. I found one or more
The referenced tickets are: Those tickets do not reference this merged Pull Request. If this Pull Request merge resolves any of those tickets, please update the "Pull Request ID" field on each ticket. A future run of this script will appropriately update them. Update Log: https://github.com/ceph/ceph/actions/runs/22280412118 |
Fix memory leaks in librbd persistent write log (PWL) cache discard operations by properly completing and deleting request objects.
ASan reported the following leaks in unittest_librbd:
Plus multiple indirect leaks totaling 2,076 bytes:
- DiscardLogOperation (224 bytes x 2)
- SyncPoint (184 bytes)
- SyncPointLogEntry (168 bytes)
- DiscardLogEntry (152 bytes x 2)
- Various string and vector allocations
SUMMARY: AddressSanitizer: 2316 byte(s) leaked in 15 allocation(s).
Root cause analysis:
Missing request completion: C_DiscardRequest objects were never deleted because their complete() method was never called. The on_write_persist callback in setup_log_operations() called complete_user_request() to invoke the user's callback, but didn't call complete() on the request itself to trigger self-deletion via Context::complete().
Write requests use WriteLogOperationSet which takes the request as its on_finish callback, ensuring complete() is eventually called. Discard requests don't use WriteLogOperationSet and must explicitly call complete() in their on_write_persist callback.
Sync point not reset: m_current_sync_point was never explicitly reset during shutdown, keeping the sync point alive even after m_log_entries was cleared. This prevented proper cleanup of the sync point chain.
Indirect leaks: C_DiscardRequest holds shared_ptr, which holds shared_ptr, which holds shared_ptr to log entries. When the request leaked, all referenced objects also leaked through the shared_ptr reference chain.
Solution:
Add discard_req->complete(r) in the on_write_persist callback (Request.cc:445) to properly delete the request object after completing the user callback and releasing the cell. This follows the same pattern as WriteLogOperationSet for write requests.
Add m_current_sync_point.reset() in shut_down() (AbstractWriteLog.cc:682) after m_log_entries.clear() to explicitly release the sync point and break the reference chain.
These fixes ensure proper cleanup of all allocated objects in both production and test environments.
Test results:
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.