Skip to content

rgw/notification: Prevent reserved_size leak by decrementing overhead on commit/abort.#67169

Merged
kchheda3 merged 2 commits intoceph:mainfrom
kchheda3:wip-fix-notification-queue-full
Feb 26, 2026
Merged

rgw/notification: Prevent reserved_size leak by decrementing overhead on commit/abort.#67169
kchheda3 merged 2 commits intoceph:mainfrom
kchheda3:wip-fix-notification-queue-full

Conversation

@kchheda3
Copy link
Contributor

@kchheda3 kchheda3 commented Feb 2, 2026

Fixes: https://tracker.ceph.com/issues/74713

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

… on commit/abort.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
@kchheda3 kchheda3 self-assigned this Feb 2, 2026
@kchheda3 kchheda3 marked this pull request as ready for review February 2, 2026 21:06
@kchheda3 kchheda3 requested a review from a team as a code owner February 2, 2026 21:06
@kchheda3 kchheda3 requested a review from yuvalif February 4, 2026 15:07
@yuvalif
Copy link
Contributor

yuvalif commented Feb 8, 2026

@kchheda3 thanks for the fix.
not sure about the approach with the 2nd commit...
maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@kchheda3
Copy link
Contributor Author

kchheda3 commented Feb 9, 2026

@kchheda3 thanks for the fix. not sure about the approach with the 2nd commit... maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@yuvalif depending on version we decide on trusting reserved_size ?
at this point trusting reserved_size is not possible because in real world that value is incorrect.
also as mentioned the calculation shouldn't be that much overhead as at any point of time we going to have not more than 1k entries ( thats the max reservation we currently support). looping 1k entries should not have major latency issues?

@yuvalif
Copy link
Contributor

yuvalif commented Feb 9, 2026

@kchheda3 thanks for the fix. not sure about the approach with the 2nd commit... maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@yuvalif depending on version we decide on trusting reserved_size ? at this point trusting reserved_size is not possible because in real world that value is incorrect. also as mentioned the calculation shouldn't be that much overhead as at any point of time we going to have not more than 1k entries ( thats the max reservation we currently support). looping 1k entries should not have major latency issues?

was thinking that we can bump the version of the urgent_data as part of 00ad83d adding a flag indicating that the queue was created with the fix in it - this would indicate that we can trust the reserved size in that queue.
if the queue does not have the flag, we ignore the value and recalculate for each reservation

@kchheda3
Copy link
Contributor Author

@kchheda3 thanks for the fix. not sure about the approach with the 2nd commit... maybe we can detect the version of https://github.com/ceph/ceph/blob/main/src/cls/2pc_queue/cls_2pc_queue_types.h#L63 and decide whether we can trust the reserved_size or not? and do the full calculation accordingly?

@yuvalif depending on version we decide on trusting reserved_size ? at this point trusting reserved_size is not possible because in real world that value is incorrect. also as mentioned the calculation shouldn't be that much overhead as at any point of time we going to have not more than 1k entries ( thats the max reservation we currently support). looping 1k entries should not have major latency issues?

was thinking that we can bump the version of the urgent_data as part of 00ad83d adding a flag indicating that the queue was created with the fix in it - this would indicate that we can trust the reserved size in that queue. if the queue does not have the flag, we ignore the value and recalculate for each reservation

yeah i did thought about it originally, but really how do you know that someone has started using brand new notifications and they can rely on flag.
other option was to add a NEW bool flag deciding whether to re-calculate and is defaulted to true.
so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.
this was getting more and more messy
so i decided to calculate always as the size was not that big (1k entries max)

@cbodley
Copy link
Contributor

cbodley commented Feb 10, 2026

so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.

isn't this all happening in an atomic cls call? i'm not sure where locking is necessary. if radosgws issue racing reservations on an un-upgraded queue, the first call will recalculate the size and update the flag atomically - the second call will see that flag and increment as usual

@yuvalif
Copy link
Contributor

yuvalif commented Feb 10, 2026

so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.

isn't this all happening in an atomic cls call? i'm not sure where locking is necessary. if radosgws issue racing reservations on an un-upgraded queue, the first call will recalculate the size and update the flag atomically - the second call will see that flag and increment as usual

agree that there should not be a race here. i think that the issue would be that in a mixed version cluster, old cls code writing to a new queue (or to an existing queue on which recalculation was performed) would still leak.
however, assuming the upgrade process is limited in time the amount of leak would be bounded

… errors

The urgent_data.reserved_size field was accumulating incorrect values over time due to a mismatch between what was added during reserve() and what was subtracted during commit()/abort(). This caused the reserved_size to grow unbounded, eventually hitting the queue capacity limit and returning ENOSPC errors even when the queue had plenty of actual space.

solution:
Add a one time self healing capability, where the reservation value is re calculated during the reserve and counter is updated with correct value.

Signed-off-by: Krunal Chheda <kchheda3@bloomberg.net>
@kchheda3 kchheda3 force-pushed the wip-fix-notification-queue-full branch from 8a6008b to 7f4eaee Compare February 10, 2026 21:25
@kchheda3
Copy link
Contributor Author

kchheda3 commented Feb 10, 2026

so first time someone upgrades or starts with new notification, it will be true and that will calculate the value first time and update the reserve_size value and then going forward we can use that calculated value, but that would require to first lock the flag and then calculate and then update the osd, so other rgw do not race to calculate.

isn't this all happening in an atomic cls call? i'm not sure where locking is necessary. if radosgws issue racing reservations on an un-upgraded queue, the first call will recalculate the size and update the flag atomically - the second call will see that flag and increment as usual

agree that there should not be a race here. i think that the issue would be that in a mixed version cluster, old cls code writing to a new queue (or to an existing queue on which recalculation was performed) would still leak. however, assuming the upgrade process is limited in time the amount of leak would be bounded

actually i have now updated the commit to only one time re calculate the value and persist with correct value in the queue.
this addresses both upgrade and new cluster scenario and also mix mode scenario.
Since this code change is in cls_queue, it will run on OSD. so even on mix mode, as long as OSD is upgraded that is handling the calculation would re-calc and then store the correct value and hence subsequent reads will read correct value and we do not have to worry about leak.

i originally thought it was rgw code, but then realized its cls code and writes will be atomic and serialize, so leveraging that i do the calculation first time and second callers will automatically get the correct value as @cbodley pointed.

@yuvalif
Copy link
Contributor

yuvalif commented Feb 11, 2026

jenkins test make check

@yuvalif
Copy link
Contributor

yuvalif commented Feb 11, 2026

just to make sure that we covered all cases:

  • new queue with new osd: osd will detect v3 on queue creation and trust the counter
  • old queue with new osd: will detect v2, recalculate and write back as v3
  • old queue with old osd (in a mixed cluster during upgrade): reservations are still leaking during upgrade, this will be fixed when old osd is upgraded and recalculates
  • new queue with old osd: reservations are still leaking during upgrade, when old osd writes back the it is done as v2. this will be fixed when old osd is upgraded and recalculates

@yuvalif
Copy link
Contributor

yuvalif commented Feb 18, 2026

jenkins test make check

@kchheda3
Copy link
Contributor Author

jenkins test make check

@anrao19
Copy link
Contributor

anrao19 commented Feb 26, 2026

Execution complete, tracker approved by @ivancich. tracker detail: https://tracker.ceph.com/issues/75124
@kchheda3 , If no further testing needed, pr can be merged

@anrao19
Copy link
Contributor

anrao19 commented Feb 26, 2026

jenkins test make check

@kchheda3 kchheda3 merged commit 5e89aff into ceph:main Feb 26, 2026
16 checks passed
@kchheda3 kchheda3 deleted the wip-fix-notification-queue-full branch February 26, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants