Skip to content

librbd/pwl: fix memory leaks in discard operations#66876

Merged
idryomov merged 1 commit intoceph:mainfrom
tchaikov:wip-librbd-pwl-fix-leaks
Feb 22, 2026
Merged

librbd/pwl: fix memory leaks in discard operations#66876
idryomov merged 1 commit intoceph:mainfrom
tchaikov:wip-librbd-pwl-fix-leaks

Conversation

@tchaikov
Copy link
Contributor

Fix memory leaks in librbd persistent write log (PWL) cache discard operations by properly completing and deleting request objects.

ASan reported the following leaks in unittest_librbd:

  Direct leak of 240 byte(s) in 1 object(s) allocated from:
    #0 operator new(unsigned long)
    #1 librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::discard(...)
       /ceph/src/librbd/cache/pwl/AbstractWriteLog.cc:935:5
    #2 TestMockCacheReplicatedWriteLog_discard_Test::TestBody()
       /ceph/src/test/librbd/cache/pwl/test_mock_ReplicatedWriteLog.cc:534:7

  Direct leak of 240 byte(s) in 1 object(s) allocated from:
    #0 operator new(unsigned long)
    #1 librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::discard(...)
       /ceph/src/librbd/cache/pwl/AbstractWriteLog.cc:935:5
    #2 TestMockCacheSSDWriteLog_discard_Test::TestBody()
       /ceph/src/test/librbd/cache/pwl/test_mock_SSDWriteLog.cc:533:7

Plus multiple indirect leaks totaling 2,076 bytes:
- DiscardLogOperation (224 bytes x 2)
- SyncPoint (184 bytes)
- SyncPointLogEntry (168 bytes)
- DiscardLogEntry (152 bytes x 2)
- Various string and vector allocations

SUMMARY: AddressSanitizer: 2316 byte(s) leaked in 15 allocation(s).

Root cause analysis:

  1. Missing request completion: C_DiscardRequest objects were never deleted because their complete() method was never called. The on_write_persist callback in setup_log_operations() called complete_user_request() to invoke the user's callback, but didn't call complete() on the request itself to trigger self-deletion via Context::complete().

    Write requests use WriteLogOperationSet which takes the request as its on_finish callback, ensuring complete() is eventually called. Discard requests don't use WriteLogOperationSet and must explicitly call complete() in their on_write_persist callback.

  2. Sync point not reset: m_current_sync_point was never explicitly reset during shutdown, keeping the sync point alive even after m_log_entries was cleared. This prevented proper cleanup of the sync point chain.

  3. Indirect leaks: C_DiscardRequest holds shared_ptr, which holds shared_ptr, which holds shared_ptr to log entries. When the request leaked, all referenced objects also leaked through the shared_ptr reference chain.

Solution:

  1. Add discard_req->complete(r) in the on_write_persist callback (Request.cc:445) to properly delete the request object after completing the user callback and releasing the cell. This follows the same pattern as WriteLogOperationSet for write requests.

  2. Add m_current_sync_point.reset() in shut_down() (AbstractWriteLog.cc:682) after m_log_entries.clear() to explicitly release the sync point and break the reference chain.

These fixes ensure proper cleanup of all allocated objects in both production and test environments.

Test results:

  • Before: 2,316 bytes leaked in 15 allocations
  • After: 0 bytes leaked
  • unittest_librbd discard tests pass successfully with ASan

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@idryomov
Copy link
Contributor

This is a pretty serious memory leak, so we would want to backport. I have opened https://tracker.ceph.com/issues/74972, please add the corresponding Fixes tag to the commit message.

@tchaikov
Copy link
Contributor Author

This is a pretty serious memory leak, so we would want to backport. I have opened https://tracker.ceph.com/issues/74972, please add the corresponding Fixes tag to the commit message.

Thanks for opening the tracker issue. I've added Fixes: https://tracker.ceph.com/issues/74972 to the commit message and updated it to address your review feedback.

@tchaikov tchaikov force-pushed the wip-librbd-pwl-fix-leaks branch from e2873e3 to 9cd9b68 Compare February 17, 2026 12:08
Fix memory leak in librbd persistent write log (PWL) cache discard
operations by properly completing request objects.

ASan reported the following leaks in unittest_librbd:

  Direct leak of 240 byte(s) in 1 object(s) allocated from:
    #0 operator new(unsigned long)
    #1 librbd::cache::pwl::AbstractWriteLog<librbd::MockImageCtx>::discard(...)
       /ceph/src/librbd/cache/pwl/AbstractWriteLog.cc:935:5
    #2 TestMockCacheReplicatedWriteLog_discard_Test::TestBody()
       /ceph/src/test/librbd/cache/pwl/test_mock_ReplicatedWriteLog.cc:534:7

  Plus multiple indirect leaks totaling 2,076 bytes through the
  shared_ptr reference chain.

Root cause:

C_DiscardRequest objects were never deleted because their complete()
method was never called. The on_write_persist callback released the
BlockGuard cell but didn't call complete() to trigger self-deletion.

Write requests use WriteLogOperationSet which takes the request as
its on_finish callback, ensuring complete() is eventually called.
Discard requests don't use WriteLogOperationSet and must explicitly
call complete() in their on_write_persist callback.

Solution:

Call discard_req->complete(r) in the on_write_persist callback and
move cell release into finish_req() -- mirroring how C_WriteRequest
handles it. The complete() -> finish() -> finish_req() chain ensures
the cell is released after the user request is completed, preserving
the same ordering as write requests.

Test results:
- Before: 2,316 bytes leaked in 15 allocations
- After: 0 bytes leaked
- unittest_librbd discard tests pass successfully with ASan

Fixes: https://tracker.ceph.com/issues/74972
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
@tchaikov tchaikov force-pushed the wip-librbd-pwl-fix-leaks branch from 9cd9b68 to fcda31a Compare February 21, 2026 07:25
@tchaikov
Copy link
Contributor Author

changelog:

  • incorporated @idryomov 's suggestion
  • rebased against main branch

@tchaikov tchaikov requested a review from idryomov February 21, 2026 11:07
@idryomov
Copy link
Contributor

@idryomov idryomov merged commit ce92c04 into ceph:main Feb 22, 2026
12 of 13 checks passed
@github-actions
Copy link

This is an automated message by src/script/redmine-upkeep.py.

I found one or more Fixes: tags in the commit messages in

git log ce92c047dafa673adc54925eae496b63c6d71b76^..ce92c047dafa673adc54925eae496b63c6d71b76

The referenced tickets are:

Those tickets do not reference this merged Pull Request. If this Pull Request merge resolves any of those tickets, please update the "Pull Request ID" field on each ticket. A future run of this script will appropriately update them.

Update Log: https://github.com/ceph/ceph/actions/runs/22280412118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants