rgw/notification: Prevent reserved_size leak by decrementing overhead on commit/abort. by kchheda3 · Pull Request #67169 · ceph/ceph

kchheda3 · 2026-02-02T21:05:55Z

Fixes: https://tracker.ceph.com/issues/74713

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

… on commit/abort. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>

yuvalif · 2026-02-08T20:17:40Z

@kchheda3 thanks for the fix.
not sure about the approach with the 2nd commit...
maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

kchheda3 · 2026-02-09T15:26:43Z

@kchheda3 thanks for the fix. not sure about the approach with the 2nd commit... maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@yuvalif depending on version we decide on trusting reserved_size ?
at this point trusting reserved_size is not possible because in real world that value is incorrect.
also as mentioned the calculation shouldn't be that much overhead as at any point of time we going to have not more than 1k entries ( thats the max reservation we currently support). looping 1k entries should not have major latency issues?

yuvalif · 2026-02-09T18:32:18Z

@kchheda3 thanks for the fix. not sure about the approach with the 2nd commit... maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@yuvalif depending on version we decide on trusting reserved_size ? at this point trusting reserved_size is not possible because in real world that value is incorrect. also as mentioned the calculation shouldn't be that much overhead as at any point of time we going to have not more than 1k entries ( thats the max reservation we currently support). looping 1k entries should not have major latency issues?

was thinking that we can bump the version of the urgent_data as part of 00ad83d adding a flag indicating that the queue was created with the fix in it - this would indicate that we can trust the reserved size in that queue.
if the queue does not have the flag, we ignore the value and recalculate for each reservation

kchheda3 · 2026-02-10T14:58:56Z

@kchheda3 thanks for the fix. not sure about the approach with the 2nd commit... maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@yuvalif depending on version we decide on trusting reserved_size ? at this point trusting reserved_size is not possible because in real world that value is incorrect. also as mentioned the calculation shouldn't be that much overhead as at any point of time we going to have not more than 1k entries ( thats the max reservation we currently support). looping 1k entries should not have major latency issues?

was thinking that we can bump the version of the urgent_data as part of 00ad83d adding a flag indicating that the queue was created with the fix in it - this would indicate that we can trust the reserved size in that queue. if the queue does not have the flag, we ignore the value and recalculate for each reservation

yeah i did thought about it originally, but really how do you know that someone has started using brand new notifications and they can rely on flag.
other option was to add a NEW bool flag deciding whether to re-calculate and is defaulted to true.
so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.
this was getting more and more messy
so i decided to calculate always as the size was not that big (1k entries max)

cbodley · 2026-02-10T18:48:05Z

so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.

isn't this all happening in an atomic cls call? i'm not sure where locking is necessary. if radosgws issue racing reservations on an un-upgraded queue, the first call will recalculate the size and update the flag atomically - the second call will see that flag and increment as usual

yuvalif · 2026-02-10T20:16:31Z

so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.

isn't this all happening in an atomic cls call? i'm not sure where locking is necessary. if radosgws issue racing reservations on an un-upgraded queue, the first call will recalculate the size and update the flag atomically - the second call will see that flag and increment as usual

agree that there should not be a race here. i think that the issue would be that in a mixed version cluster, old cls code writing to a new queue (or to an existing queue on which recalculation was performed) would still leak.
however, assuming the upgrade process is limited in time the amount of leak would be bounded

… errors The urgent_data.reserved_size field was accumulating incorrect values over time due to a mismatch between what was added during reserve() and what was subtracted during commit()/abort(). This caused the reserved_size to grow unbounded, eventually hitting the queue capacity limit and returning ENOSPC errors even when the queue had plenty of actual space. solution: Add a one time self healing capability, where the reservation value is re calculated during the reserve and counter is updated with correct value. Signed-off-by: Krunal Chheda <kchheda3@bloomberg.net>

kchheda3 · 2026-02-10T21:29:05Z

so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.

isn't this all happening in an atomic cls call? i'm not sure where locking is necessary. if radosgws issue racing reservations on an un-upgraded queue, the first call will recalculate the size and update the flag atomically - the second call will see that flag and increment as usual

agree that there should not be a race here. i think that the issue would be that in a mixed version cluster, old cls code writing to a new queue (or to an existing queue on which recalculation was performed) would still leak. however, assuming the upgrade process is limited in time the amount of leak would be bounded

actually i have now updated the commit to only one time re calculate the value and persist with correct value in the queue.
this addresses both upgrade and new cluster scenario and also mix mode scenario.
Since this code change is in cls_queue, it will run on OSD. so even on mix mode, as long as OSD is upgraded that is handling the calculation would re-calc and then store the correct value and hence subsequent reads will read correct value and we do not have to worry about leak.

i originally thought it was rgw code, but then realized its cls code and writes will be atomic and serialize, so leveraging that i do the calculation first time and second callers will automatically get the correct value as @cbodley pointed.

yuvalif · 2026-02-11T10:33:51Z

jenkins test make check

yuvalif · 2026-02-11T10:40:38Z

just to make sure that we covered all cases:

new queue with new osd: osd will detect v3 on queue creation and trust the counter
old queue with new osd: will detect v2, recalculate and write back as v3
old queue with old osd (in a mixed cluster during upgrade): reservations are still leaking during upgrade, this will be fixed when old osd is upgraded and recalculates
new queue with old osd: reservations are still leaking during upgrade, when old osd writes back the it is done as v2. this will be fixed when old osd is upgraded and recalculates

yuvalif · 2026-02-18T12:56:55Z

jenkins test make check

kchheda3 · 2026-02-23T15:22:33Z

jenkins test make check

anrao19 · 2026-02-26T06:12:17Z

Execution complete, tracker approved by @ivancich. tracker detail: https://tracker.ceph.com/issues/75124
@kchheda3 , If no further testing needed, pr can be merged

anrao19 · 2026-02-26T06:12:29Z

jenkins test make check

rgw/notification: Prevent reserved_size leak by decrementing overhead…

00ad83d

… on commit/abort. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>

kchheda3 self-assigned this Feb 2, 2026

kchheda3 marked this pull request as ready for review February 2, 2026 21:06

kchheda3 requested a review from a team as a code owner February 2, 2026 21:06

kchheda3 requested a review from yuvalif February 4, 2026 15:07

kchheda3 force-pushed the wip-fix-notification-queue-full branch from 8a6008b to 7f4eaee Compare February 10, 2026 21:25

yuvalif approved these changes Feb 11, 2026

View reviewed changes

yuvalif added the needs-qa label Feb 11, 2026

ivancich added bug-fix rgw labels Feb 12, 2026

anrao19 added the wip-anrao3-testing label Feb 23, 2026

anrao19 added TESTED ready-to-merge and removed wip-anrao3-testing labels Feb 26, 2026

kchheda3 merged commit 5e89aff into ceph:main Feb 26, 2026
16 checks passed

kchheda3 deleted the wip-fix-notification-queue-full branch February 26, 2026 14:50

This was referenced Feb 27, 2026

squid: rgw/notification: Prevent reserved_size leak by decrementing overhead on commit/abort. #67575

Open

tentacle: rgw/notification: Prevent reserved_size leak by decrementing overhead on commit/abort. #67576

Open

Conversation

kchheda3 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

yuvalif commented Feb 8, 2026

Uh oh!

kchheda3 commented Feb 9, 2026

Uh oh!

yuvalif commented Feb 9, 2026

Uh oh!

kchheda3 commented Feb 10, 2026

Uh oh!

cbodley commented Feb 10, 2026

Uh oh!

yuvalif commented Feb 10, 2026

Uh oh!

kchheda3 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuvalif commented Feb 11, 2026

Uh oh!

yuvalif commented Feb 11, 2026

Uh oh!

yuvalif commented Feb 18, 2026

Uh oh!

kchheda3 commented Feb 23, 2026

Uh oh!

anrao19 commented Feb 26, 2026

Uh oh!

anrao19 commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kchheda3 commented Feb 2, 2026 •

edited

Loading

kchheda3 commented Feb 10, 2026 •

edited

Loading