Bug #64519: OSD/MON: No snapshot metadata keys trimming - RADOS - Ceph

Actions

Copy link

Bug #64519

open

OSD/MON: No snapshot metadata keys trimming

Added by Matan Breizman about 2 years ago. Updated 8 days ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Naveen Naidu

Category:

Snapshots

Target version:

Ceph - v20.0.0

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

57548

Tags (freeform):

backport_processed

Merge Commit:

53a790fc0a36b4980ba2f8522293d2c8a4a5c62c

Fixed In:

v19.3.0-3837-g53a790fc0a

Released In:

v20.2.0~2352

Upkeep Timestamp:

2025-11-01T00:58:06+00:00

Description

The Monitor's keys of purged_snap_ / purged_epoch_ and OSD's PSN_ (SnapMapper::PURGED_SNAP_PREFIX) keys are not trimmed and will continue to accumulate.

Relevant threads:
https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/UOJG46YXTIPOXJUSELIN42ATAD5FPMDY/
https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/B72HSXIGX6IJFLTZU2SPXCQQWFTOXS5A/

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DHUBML65HTPRLXXNBNEWIKBVTP4DFQSR/

Files

20260224-mon-store-dump-keys.zip (4.6 MB) 20260224-mon-store-dump-keys.zip

Eugen Block, 02/27/2026 07:46 AM

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Matan Breizman about 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Joshua Baergen about 2 years ago

This reminded me of the notes in https://pad.ceph.com/p/removing_removed_snaps/timeslider#4651 that talk about why the set of deleted snapshots need to stick around for a while. But I'm assuming that "a while" probably doesn't need to mean permanently...

Actions

Copy link

Updated by Matan Breizman almost 2 years ago

Related to Bug #62983: OSD/MON: purged snap keys are not merged added

Actions

Copy link

Updated by Matan Breizman almost 2 years ago

Status changed from New to In Progress

https://tracker.ceph.com/issues/62983 should help with avoiding the gaps in the purged snaps ids intervals. As a result, all the purged_snap ids will be mereged into a single entry.

For clusters already impacted by this issue, https://github.com/ceph/ceph/pull/53545 may help with removing the "ghost" snapids which cause the gap and allow merging all the entries.

Actions

Copy link

Updated by Matan Breizman almost 2 years ago

Assignee set to Matan Breizman
Pull request ID set to 53545

Adding 53545 as a candidate for fixing this issue, this will require additional documentation on how to use the tool - so I'll keep the tracker open.

Actions

Copy link

Updated by Eugen Block almost 2 years ago

I know I'm a bit early asking this, but I helped raise this issue and Mykola picked it up in the devel mailing list. I talked to one of our customers who is affected by this (more than 40 Million purged_snap entries) and they would be interested testing this feature on their secondary site (they mirror rbd images). But they're currently in the planning process to remove the second site, I have no ETA though. I expect this fix (and the respective tool to trim affected mon stores) not earlier than for Squid. There's no telling if and when they will be able to upgrade to Squid, they recently upgraded to Quincy though. So would there be a chance to backport this to Reef and Quincy as well, depending on which release they'll be on when this is considered ready?
And a couple more questions regarding the purge tool:

Will it be possible to trim the keys online (without cluster downtime)?
How "safe" will it be? What could go wrong and would there be some rollback mechanism?

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

Looks pretty backportable but let's wait for Matan's word.

Actions

Copy link

Updated by Matan Breizman almost 2 years ago

Backport set to quincy,reef, squid

Eugen Block wrote in #note-6:

I know I'm a bit early asking this, but I helped raise this issue and Mykola picked it up in the devel mailing list. I talked to one of our customers who is affected by this (more than 40 Million purged_snap entries) and they would be interested testing this feature on their secondary site (they mirror rbd images). But they're currently in the planning process to remove the second site, I have no ETA though. I expect this fix (and the respective tool to trim affected mon stores) not earlier than for Squid. There's no telling if and when they will be able to upgrade to Squid, they recently upgraded to Quincy though. So would there be a chance to backport this to Reef and Quincy as well, depending on which release they'll be on when this is considered ready?

Hey Eugen,
There should be no issues with backporting this back to Q as this PR offers a new separated command.
The relevant usage of the command will be using with the default option:

     *  * Default: All the snapids in the given range which are not
     *    marked as purged in the Monitor will be removed. Mostly useful
     *    for cases in which the snapid is leaked in the client side.
     *    See: https://tracker.ceph.com/issues/64646

And a couple more questions regarding the purge tool:

Will it be possible to trim the keys online (without cluster downtime)?

How "safe" will it be? What could go wrong and would there be some rollback mechanism?

Yes, it's possible. The (online) command doesn't require shutting down the OSDs or MONs.
The command was also added to our testing workloads and seem to work well.
The tricky part is the unknown unknowns. I do not expect anything to go wrong as the command will only interact with ghost snapids. Moreover, the command can also be used gradually (short snapid removal intervals) to verify nothing goes wrong while using it.

Actions

Copy link

Updated by Radoslaw Zarzynski almost 2 years ago

The PR is in QA.

Actions

Copy link

#10

Updated by Eugen Block almost 2 years ago

Thanks, Matan! It sounds very promising. I talked to the customer and they are willing to test this cleanup procedure on their secondary site. Apparently, this will be backported to Quincy so that will make it easier. I'm still not entirely sure if I understand all the required steps or if simply running ceph osd pool force-remove-snap unique_pool_0 will suffice. But maybe we can discuss that in Slack or something.

Actions

Copy link

#11

Updated by Matan Breizman almost 2 years ago

Eugen Block wrote in #note-10:

Thanks, Matan! It sounds very promising. I talked to the customer and they are willing to test this cleanup procedure on their secondary site. Apparently, this will be backported to Quincy so that will make it easier. I'm still not entirely sure if I understand all the required steps or if simply running ceph osd pool force-remove-snap unique_pool_0 will suffice. But maybe we can discuss that in Slack or something.

I'll provide a detailed explanation once the PR passed QA. Broadly speaking, only running the command should be sufficient.

Actions

Copy link

#12

Updated by Radoslaw Zarzynski almost 2 years ago

note from scrub: bump up.

Actions

Copy link

#13

Updated by Matan Breizman almost 2 years ago

Tracker changed from Bug to Feature
Priority changed from Normal to Low
Regression deleted (No)
Severity deleted (~~3 - minor~~)

Actions

Copy link

#14

Updated by Matan Breizman almost 2 years ago · Edited

Pull request ID changed from 53545 to 57549

Separating the previous PR into PR#57548 and PR#57549.

Actions

Copy link

#15

Updated by Matan Breizman almost 2 years ago

Pull request ID changed from 57549 to 57548

Actions

Copy link

#16

Updated by Matan Breizman almost 2 years ago

Description updated (diff)
Status changed from In Progress to Fix Under Review

Actions

Copy link

#17

Updated by Eugen Block over 1 year ago

I might be too early asking this, but I upgraded one of my test clusters to 17.2.8, eager to test this new tool. This cluster has 1068805 "purged_snap" entries, yesterday I ran ceph osd pool force-remove-snap spiegel1 1 1000000 to purge the first batch. The OSDs are still snaptrimming (I only have 4 OSDs in that lab, the pool has 8 PGs). But I wanted to get an early impression of the results and ran a "dump-keys" after the snaptrimming had finished almost 80% of the snaptrim_queue. But the number of purged_snap entries remains at 1068805. Is that expected? I had hoped that this would also reduce the number of entries, leading to a smaller mon store. Am I just being impatient or did I misunderstand the purpose of this tool?

Actions

Copy link

#18

Updated by Konstantin Shalygin about 1 year ago

Backport changed from quincy,reef, squid to reef, squid

Actions

Copy link

#19

Updated by Konstantin Shalygin about 1 year ago

Tracker changed from Feature to Bug
Status changed from Fix Under Review to Pending Backport
Target version set to v20.0.0
Regression set to No
Severity set to 3 - minor

Actions

Copy link

#20

Updated by Upkeep Bot about 1 year ago

Copied to Backport #70026: reef: OSD/MON: No snapshot metadata keys trimming added

Actions

Copy link

#21

Updated by Upkeep Bot about 1 year ago

Copied to Backport #70027: squid: OSD/MON: No snapshot metadata keys trimming added

Actions

Copy link

#22

Updated by Upkeep Bot about 1 year ago

Tags (freeform) set to backport_processed

Actions

Copy link

#23

Updated by Matan Breizman about 1 year ago

Backport deleted (~~reef, squid~~)

Note: I'm not sure we want to backport this until this is verified to fix this issue as well.
As the PR states:

Fixes: https://tracker.ceph.com/issues/66122

Possibly Fixes: https://tracker.ceph.com/issues/64519

I might be too early asking this, but I upgraded one of my test clusters to 17.2.8, eager to test this new tool. This cluster has 1068805 "purged_snap" entries, yesterday I ran ceph osd pool force-remove-snap spiegel1 1 1000000 to purge the first batch. The OSDs are still snaptrimming (I only have 4 OSDs in that lab, the pool has 8 PGs). But I wanted to get an early impression of the results and ran a "dump-keys" after the snaptrimming had finished almost 80% of the snaptrim_queue. But the number of purged_snap entries remains at 1068805. Is that expected? I had hoped that this would also reduce the number of entries, leading to a smaller mon store. Am I just being impatient or did I misunderstand the purpose of this tool?

Thanks for sharing the information above!
Is this data still relevant? Are you able to still work with this cluster?

Actions

Copy link

#24

Updated by Eugen Block about 1 year ago

Yes, the cluster is still usable, but it's really just a single-node test cluster, there isn't any real load or any applications except for rbd mirror and rgw sync to a different single node test cluster. I didn't recommend to our customer to use this tool yet until we know if this works as designed or not. Our main goal was to trim the mon store, but that doesn't seem to be the case here.

Actions

Copy link

#25

Updated by Upkeep Bot 9 months ago

Merge Commit set to 53a790fc0a36b4980ba2f8522293d2c8a4a5c62c
Fixed In set to v19.3.0-3837-g53a790fc0a3
Upkeep Timestamp set to 2025-07-09T14:05:16+00:00

Actions

Copy link

#26

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-3837-g53a790fc0a3 to v19.3.0-3837-g53a790fc0a
Upkeep Timestamp changed from 2025-07-09T14:05:16+00:00 to 2025-07-14T17:41:20+00:00

Actions

Copy link

#27

Updated by Eugen Block 6 months ago

This issue might have another heavy impact. The customer created new OSDs which takes several minutes for each OSD until they boot (SSD only). Each OSD process spikes to around 140 GB RAM usage, killing the host while booting the OSD daemon. Without debug logs enabled there's nothing to see during that boot time, but we enabled debug logs for one OSD and saw a huge amount of purged_snaps entries. Unfortunately, the customer wasn't able to capture that log for me to confirm, I'm currently waiting for new logs. But it appears that these untrimmed snapshots impact more than just the mon store. Using the orchestrator to create multiple OSDs at once is impossible right now. As soon as I have logs to back up my suspicion, I'll attach them to this tracker.

Actions

Copy link

#28

Updated by Eugen Block 5 months ago

I could reproduce this in my lab environment, it's a single node cluster with more than one million purged_snaps (1068905 entries in the mon store). With debug level 10 (debug_osd = 10) there's one line standing out in the log: it contains all the snap ranges, here's a short excerpt:

2025-10-21T16:28:59.590+0000 7f23b3c5b640 10 snap_mapper.record_purged_snaps purged_snaps {505={20=[1~1]},513={20=[5~1]},516={20=[4~1]},517={20=[3~1]},884={21=[2~1]},888={21=[4~1]},898={21=[7~3]}

This single line separated by comma results in more than 1.6 million lines.

The customer cluster where we debugged this had more than 42 million purged_snaps two years ago (we haven't checked in a while). They are currently expanding the cluster, but adding one OSD at a time to prevent OOM killers is not a great procedure.

Actions

Copy link

#29

Updated by Upkeep Bot 5 months ago

Released In set to v20.2.0~2352
Upkeep Timestamp changed from 2025-07-14T17:41:20+00:00 to 2025-11-01T00:58:06+00:00

Actions

Copy link

#30

Updated by Matan Breizman 27 days ago · Edited

Eugen Block wrote in #note-17:

I might be too early asking this, but I upgraded one of my test clusters to 17.2.8, eager to test this new tool. This cluster has 1068805 "purged_snap" entries, yesterday I ran ceph osd pool force-remove-snap spiegel1 1 1000000 to purge the first batch. The OSDs are still snaptrimming (I only have 4 OSDs in that lab, the pool has 8 PGs). But I wanted to get an early impression of the results and ran a "dump-keys" after the snaptrimming had finished almost 80% of the snaptrim_queue. But the number of purged_snap entries remains at 1068805. Is that expected? I had hoped that this would also reduce the number of entries, leading to a smaller mon store. Am I just being impatient or did I misunderstand the purpose of this tool?

Hey Eugen, thank you for the details.

I would like to try to understand why did https://tracker.ceph.com/issues/62983 didn't help in the above scenario.
Would it be possible to share the "dump-keys" output? (any of the impacted clusters)

Secondly, PR https://github.com/ceph/ceph/pull/54024 (~~https://github.com/ceph/ceph/pull/55841 (backported to Q,R,S)~~ edited, wrong pr ) should have prevented the issue for newer clusters.
Are you aware of clusters hitting this issue after the above merge (v17.2.8 for Q)?
Alternatively, did the number of purged snaps continue to accumulate after upgrading the cluster?

Thank you!

Actions

Copy link

#31

Updated by Eugen Block 27 days ago

Matan Breizman wrote in #note-30:

I would like to try to understand why did https://tracker.ceph.com/issues/62983 didn't help in the above scenario.
Would it be possible to share the "dump-keys" output? (any of the impacted clusters)

Just to be clear, I tested the tool on 17.2.8, it didn't seem to help with the mon store. In the meantime, I upgraded that (single node) cluster to Reef 18.2.7 (to be on the same version as the customer), but apparently, tracker 62983 was released in Squid 19.2.0. And I haven't tested that tool in Reef either. Should I upgrade to latest Squid and rerun the tool?
I will provide the dump-keys output from before and after retrying on Squid.

Secondly, PR https://github.com/ceph/ceph/pull/55841 (backported to Q,R,S) should have prevented the issue for newer clusters.
Are you aware of clusters hitting this issue after the above merge (v17.2.8 for Q)?
Alternatively, did the number of purged snaps continue to accumulate after upgrading the cluster?

We currently don't have customers who create that many snapshots, so I have no answer to that. And more importantly, this customer rebuilt that cluster into a stretch cluster (not the "official" stretch mode though), so snapshots are not created anymore, hence no more growth. But they kept their secondary site as a pre-prod environment which would allow us to test there with less risks.

So in the next step, I will upload a dump-keys from my test cluster on 18.2.7 before upgrading to Squid. Then I'll rerun the command force-remove-snap again and see if that changes anything regarding the mon store. If necessary, I can ask the customer for a dump-key from their secondary site as well.

Actions

Copy link

#32

Updated by Matan Breizman 27 days ago

Priority changed from Low to Normal

Eugen Block wrote in #note-31:

Just to be clear, I tested the tool on 17.2.8, it didn't seem to help with the mon store. In the meantime, I upgraded that (single node) cluster to Reef 18.2.7 (to be on the same version as the customer), but apparently, tracker 62983 was released in Squid 19.2.0. And I haven't tested that tool in Reef either. Should I upgrade to latest Squid and rerun the tool?
I will provide the dump-keys output from before and after retrying on Squid.

Allow me to clarify to avoid confusion:

Current tracker (https://tracker.ceph.com/issues/64519) with attached PR that introduced "force-remove-snap mon command" backpoorted to Q,R,S.

Tracker https://tracker.ceph.com/issues/62983 with PR https://github.com/ceph/ceph/pull/54024 - Should only prevent the issue from occurring in the future (does not fix impacted clusters).

As long as the test cluster is impacted and the tool is available, any version will do (without upgrading).
I would like to examine the key output, please mention if the tool was used or not when sharing the available output.
We can discuss the next steps afterwards.

Actions

Copy link

#33

Updated by Matan Breizman 25 days ago

Assignee changed from Matan Breizman to Naveen Naidu

Hey Naveen, as discussed offline, can you please look into this?
Thanks!

Actions

Copy link

#34

Updated by Eugen Block 24 days ago

To get a better impression, I re-ran the tool again. Currently, I'm waiting for the snap trims to finish, there were more than 2.5 million snap_trimq_len per PG, after three days there are 2.4 million left, so that's gonna take a while. But I will share the dump-keys from before the last run.

Actions

Copy link

#35

Updated by Eugen Block 24 days ago

File 20260224-mon-store-dump-keys.zip 20260224-mon-store-dump-keys.zip added

Actions

Copy link

#36

Updated by Eugen Block 8 days ago

After 16 days of snaptrimming, the combined snap_trimq_len is still > 9 million:

for i in $(ceph pg ls | grep -E "^37\." | cut -d " " -f 1); do ceph pg $i query | jq -r '.snap_trimq_len'; done
314288
175756
865047
259249
221006
281201
0
219341
346177
345651
202414
558398
0
296885
291407
348982
304399
338039
204312
281165
220856
605059
194487
197399
339641
195008
335498
0
194372
544950
218017
520124

So this is gonna take another week or two, I can't tell for sure. I've been playing around with configs to speed up the trimming process since there's no client IO on this test cluster. But it quickly eats up all RAM and leads to OOM killers. So now I have a while loop running, setting nosnaptrim after 90 seconds of snaptrimming, sleep for 60 seconds until the load settles again, and then unset nosnaptrim again. This could be really tough for a production cluster with a lot more snapshots than I had in my lab. Although I expect real hardware to deal way better with snaptrims than my VM. I'll report back once my test cluster is done with snaptrims.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #64519

OSD/MON: No snapshot metadata keys trimming

Updated by Matan Breizman about 2 years ago

Updated by Joshua Baergen about 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Eugen Block almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Eugen Block almost 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Matan Breizman almost 2 years ago · Edited

Updated by Matan Breizman almost 2 years ago

Updated by Matan Breizman almost 2 years ago

Updated by Eugen Block over 1 year ago

Updated by Konstantin Shalygin about 1 year ago

Updated by Konstantin Shalygin about 1 year ago

Updated by Upkeep Bot about 1 year ago

Updated by Upkeep Bot about 1 year ago

Updated by Upkeep Bot about 1 year ago

Updated by Matan Breizman about 1 year ago

Updated by Eugen Block about 1 year ago

Updated by Upkeep Bot 9 months ago

Updated by Upkeep Bot 8 months ago

Updated by Eugen Block 6 months ago

Updated by Eugen Block 5 months ago

Updated by Upkeep Bot 5 months ago

Updated by Matan Breizman 27 days ago · Edited

Updated by Eugen Block 27 days ago

Updated by Matan Breizman 27 days ago

Updated by Matan Breizman 25 days ago

Updated by Eugen Block 24 days ago

Updated by Eugen Block 24 days ago

Updated by Eugen Block 8 days ago