Project

General

Profile

Actions

Bug #64519

open

OSD/MON: No snapshot metadata keys trimming

Added by Matan Breizman about 2 years ago. Updated 8 days ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Snapshots
Target version:
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-3837-g53a790fc0a
Released In:
v20.2.0~2352
Upkeep Timestamp:
2025-11-01T00:58:06+00:00

Description


Files

20260224-mon-store-dump-keys.zip (4.6 MB) 20260224-mon-store-dump-keys.zip Eugen Block, 02/27/2026 07:46 AM

Related issues 3 (2 open1 closed)

Related to RADOS - Bug #62983: OSD/MON: purged snap keys are not mergedResolvedMatan Breizman

Actions
Copied to RADOS - Backport #70026: reef: OSD/MON: No snapshot metadata keys trimmingDeferredMatan BreizmanActions
Copied to RADOS - Backport #70027: squid: OSD/MON: No snapshot metadata keys trimmingDeferredMatan BreizmanActions
Actions #1

Updated by Matan Breizman about 2 years ago

  • Description updated (diff)
Actions #2

Updated by Joshua Baergen about 2 years ago

This reminded me of the notes in https://pad.ceph.com/p/removing_removed_snaps/timeslider#4651 that talk about why the set of deleted snapshots need to stick around for a while. But I'm assuming that "a while" probably doesn't need to mean permanently...

Actions #3

Updated by Matan Breizman almost 2 years ago

  • Related to Bug #62983: OSD/MON: purged snap keys are not merged added
Actions #4

Updated by Matan Breizman almost 2 years ago

  • Status changed from New to In Progress

https://tracker.ceph.com/issues/62983 should help with avoiding the gaps in the purged snaps ids intervals. As a result, all the purged_snap ids will be mereged into a single entry.

For clusters already impacted by this issue, https://github.com/ceph/ceph/pull/53545 may help with removing the "ghost" snapids which cause the gap and allow merging all the entries.

Actions #5

Updated by Matan Breizman almost 2 years ago

  • Assignee set to Matan Breizman
  • Pull request ID set to 53545

Adding 53545 as a candidate for fixing this issue, this will require additional documentation on how to use the tool - so I'll keep the tracker open.

Actions #6

Updated by Eugen Block almost 2 years ago

I know I'm a bit early asking this, but I helped raise this issue and Mykola picked it up in the devel mailing list. I talked to one of our customers who is affected by this (more than 40 Million purged_snap entries) and they would be interested testing this feature on their secondary site (they mirror rbd images). But they're currently in the planning process to remove the second site, I have no ETA though. I expect this fix (and the respective tool to trim affected mon stores) not earlier than for Squid. There's no telling if and when they will be able to upgrade to Squid, they recently upgraded to Quincy though. So would there be a chance to backport this to Reef and Quincy as well, depending on which release they'll be on when this is considered ready?
And a couple more questions regarding the purge tool:
  1. Will it be possible to trim the keys online (without cluster downtime)?
  2. How "safe" will it be? What could go wrong and would there be some rollback mechanism?
Actions #7

Updated by Radoslaw Zarzynski almost 2 years ago

Looks pretty backportable but let's wait for Matan's word.

Actions #8

Updated by Matan Breizman almost 2 years ago

  • Backport set to quincy,reef, squid

Eugen Block wrote in #note-6:

I know I'm a bit early asking this, but I helped raise this issue and Mykola picked it up in the devel mailing list. I talked to one of our customers who is affected by this (more than 40 Million purged_snap entries) and they would be interested testing this feature on their secondary site (they mirror rbd images). But they're currently in the planning process to remove the second site, I have no ETA though. I expect this fix (and the respective tool to trim affected mon stores) not earlier than for Squid. There's no telling if and when they will be able to upgrade to Squid, they recently upgraded to Quincy though. So would there be a chance to backport this to Reef and Quincy as well, depending on which release they'll be on when this is considered ready?

Hey Eugen,
There should be no issues with backporting this back to Q as this PR offers a new separated command.
The relevant usage of the command will be using with the default option:

     *  * Default: All the snapids in the given range which are not
     *    marked as purged in the Monitor will be removed. Mostly useful
     *    for cases in which the snapid is leaked in the client side.
     *    See: https://tracker.ceph.com/issues/64646

And a couple more questions regarding the purge tool:
  1. Will it be possible to trim the keys online (without cluster downtime)?
  2. How "safe" will it be? What could go wrong and would there be some rollback mechanism?
  1. Yes, it's possible. The (online) command doesn't require shutting down the OSDs or MONs.
  2. The command was also added to our testing workloads and seem to work well.
    The tricky part is the unknown unknowns. I do not expect anything to go wrong as the command will only interact with ghost snapids. Moreover, the command can also be used gradually (short snapid removal intervals) to verify nothing goes wrong while using it.
Actions #9

Updated by Radoslaw Zarzynski almost 2 years ago

The PR is in QA.

Actions #10

Updated by Eugen Block almost 2 years ago

Thanks, Matan! It sounds very promising. I talked to the customer and they are willing to test this cleanup procedure on their secondary site. Apparently, this will be backported to Quincy so that will make it easier. I'm still not entirely sure if I understand all the required steps or if simply running ceph osd pool force-remove-snap unique_pool_0 will suffice. But maybe we can discuss that in Slack or something.

Actions #11

Updated by Matan Breizman almost 2 years ago

Eugen Block wrote in #note-10:

Thanks, Matan! It sounds very promising. I talked to the customer and they are willing to test this cleanup procedure on their secondary site. Apparently, this will be backported to Quincy so that will make it easier. I'm still not entirely sure if I understand all the required steps or if simply running ceph osd pool force-remove-snap unique_pool_0 will suffice. But maybe we can discuss that in Slack or something.

I'll provide a detailed explanation once the PR passed QA. Broadly speaking, only running the command should be sufficient.

Actions #12

Updated by Radoslaw Zarzynski almost 2 years ago

note from scrub: bump up.

Actions #13

Updated by Matan Breizman almost 2 years ago

  • Tracker changed from Bug to Feature
  • Priority changed from Normal to Low
  • Regression deleted (No)
  • Severity deleted (3 - minor)
Actions #14

Updated by Matan Breizman almost 2 years ago · Edited

  • Pull request ID changed from 53545 to 57549

Separating the previous PR into PR#57548 and PR#57549.

Actions #15

Updated by Matan Breizman almost 2 years ago

  • Pull request ID changed from 57549 to 57548
Actions #16

Updated by Matan Breizman almost 2 years ago

  • Description updated (diff)
  • Status changed from In Progress to Fix Under Review
Actions #17

Updated by Eugen Block over 1 year ago

I might be too early asking this, but I upgraded one of my test clusters to 17.2.8, eager to test this new tool. This cluster has 1068805 "purged_snap" entries, yesterday I ran ceph osd pool force-remove-snap spiegel1 1 1000000 to purge the first batch. The OSDs are still snaptrimming (I only have 4 OSDs in that lab, the pool has 8 PGs). But I wanted to get an early impression of the results and ran a "dump-keys" after the snaptrimming had finished almost 80% of the snaptrim_queue. But the number of purged_snap entries remains at 1068805. Is that expected? I had hoped that this would also reduce the number of entries, leading to a smaller mon store. Am I just being impatient or did I misunderstand the purpose of this tool?

Actions #18

Updated by Konstantin Shalygin about 1 year ago

  • Backport changed from quincy,reef, squid to reef, squid
Actions #19

Updated by Konstantin Shalygin about 1 year ago

  • Tracker changed from Feature to Bug
  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v20.0.0
  • Regression set to No
  • Severity set to 3 - minor
Actions #20

Updated by Upkeep Bot about 1 year ago

  • Copied to Backport #70026: reef: OSD/MON: No snapshot metadata keys trimming added
Actions #21

Updated by Upkeep Bot about 1 year ago

  • Copied to Backport #70027: squid: OSD/MON: No snapshot metadata keys trimming added
Actions #22

Updated by Upkeep Bot about 1 year ago

  • Tags (freeform) set to backport_processed
Actions #23

Updated by Matan Breizman about 1 year ago

  • Backport deleted (reef, squid)

Note: I'm not sure we want to backport this until this is verified to fix this issue as well.
As the PR states:

Fixes: https://tracker.ceph.com/issues/66122

Possibly Fixes: https://tracker.ceph.com/issues/64519

I might be too early asking this, but I upgraded one of my test clusters to 17.2.8, eager to test this new tool. This cluster has 1068805 "purged_snap" entries, yesterday I ran ceph osd pool force-remove-snap spiegel1 1 1000000 to purge the first batch. The OSDs are still snaptrimming (I only have 4 OSDs in that lab, the pool has 8 PGs). But I wanted to get an early impression of the results and ran a "dump-keys" after the snaptrimming had finished almost 80% of the snaptrim_queue. But the number of purged_snap entries remains at 1068805. Is that expected? I had hoped that this would also reduce the number of entries, leading to a smaller mon store. Am I just being impatient or did I misunderstand the purpose of this tool?

Thanks for sharing the information above!
Is this data still relevant? Are you able to still work with this cluster?

Actions #24

Updated by Eugen Block about 1 year ago

Yes, the cluster is still usable, but it's really just a single-node test cluster, there isn't any real load or any applications except for rbd mirror and rgw sync to a different single node test cluster. I didn't recommend to our customer to use this tool yet until we know if this works as designed or not. Our main goal was to trim the mon store, but that doesn't seem to be the case here.

Actions #25

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 53a790fc0a36b4980ba2f8522293d2c8a4a5c62c
  • Fixed In set to v19.3.0-3837-g53a790fc0a3
  • Upkeep Timestamp set to 2025-07-09T14:05:16+00:00
Actions #26

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-3837-g53a790fc0a3 to v19.3.0-3837-g53a790fc0a
  • Upkeep Timestamp changed from 2025-07-09T14:05:16+00:00 to 2025-07-14T17:41:20+00:00
Actions #27

Updated by Eugen Block 6 months ago

This issue might have another heavy impact. The customer created new OSDs which takes several minutes for each OSD until they boot (SSD only). Each OSD process spikes to around 140 GB RAM usage, killing the host while booting the OSD daemon. Without debug logs enabled there's nothing to see during that boot time, but we enabled debug logs for one OSD and saw a huge amount of purged_snaps entries. Unfortunately, the customer wasn't able to capture that log for me to confirm, I'm currently waiting for new logs. But it appears that these untrimmed snapshots impact more than just the mon store. Using the orchestrator to create multiple OSDs at once is impossible right now. As soon as I have logs to back up my suspicion, I'll attach them to this tracker.

Actions #28

Updated by Eugen Block 5 months ago

I could reproduce this in my lab environment, it's a single node cluster with more than one million purged_snaps (1068905 entries in the mon store). With debug level 10 (debug_osd = 10) there's one line standing out in the log: it contains all the snap ranges, here's a short excerpt:

2025-10-21T16:28:59.590+0000 7f23b3c5b640 10 snap_mapper.record_purged_snaps purged_snaps {505={20=[1~1]},513={20=[5~1]},516={20=[4~1]},517={20=[3~1]},884={21=[2~1]},888={21=[4~1]},898={21=[7~3]}

This single line separated by comma results in more than 1.6 million lines.

The customer cluster where we debugged this had more than 42 million purged_snaps two years ago (we haven't checked in a while). They are currently expanding the cluster, but adding one OSD at a time to prevent OOM killers is not a great procedure.

Actions #29

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2352
  • Upkeep Timestamp changed from 2025-07-14T17:41:20+00:00 to 2025-11-01T00:58:06+00:00
Actions #30

Updated by Matan Breizman 27 days ago · Edited

Eugen Block wrote in #note-17:

I might be too early asking this, but I upgraded one of my test clusters to 17.2.8, eager to test this new tool. This cluster has 1068805 "purged_snap" entries, yesterday I ran ceph osd pool force-remove-snap spiegel1 1 1000000 to purge the first batch. The OSDs are still snaptrimming (I only have 4 OSDs in that lab, the pool has 8 PGs). But I wanted to get an early impression of the results and ran a "dump-keys" after the snaptrimming had finished almost 80% of the snaptrim_queue. But the number of purged_snap entries remains at 1068805. Is that expected? I had hoped that this would also reduce the number of entries, leading to a smaller mon store. Am I just being impatient or did I misunderstand the purpose of this tool?

Hey Eugen, thank you for the details.
  • I would like to try to understand why did https://tracker.ceph.com/issues/62983 didn't help in the above scenario.
    Would it be possible to share the "dump-keys" output? (any of the impacted clusters)

Thank you!

Actions #31

Updated by Eugen Block 27 days ago

Matan Breizman wrote in #note-30:

  • I would like to try to understand why did https://tracker.ceph.com/issues/62983 didn't help in the above scenario.
    Would it be possible to share the "dump-keys" output? (any of the impacted clusters)

Just to be clear, I tested the tool on 17.2.8, it didn't seem to help with the mon store. In the meantime, I upgraded that (single node) cluster to Reef 18.2.7 (to be on the same version as the customer), but apparently, tracker 62983 was released in Squid 19.2.0. And I haven't tested that tool in Reef either. Should I upgrade to latest Squid and rerun the tool?
I will provide the dump-keys output from before and after retrying on Squid.

  • Secondly, PR https://github.com/ceph/ceph/pull/55841 (backported to Q,R,S) should have prevented the issue for newer clusters.
    Are you aware of clusters hitting this issue after the above merge (v17.2.8 for Q)?
    Alternatively, did the number of purged snaps continue to accumulate after upgrading the cluster?

We currently don't have customers who create that many snapshots, so I have no answer to that. And more importantly, this customer rebuilt that cluster into a stretch cluster (not the "official" stretch mode though), so snapshots are not created anymore, hence no more growth. But they kept their secondary site as a pre-prod environment which would allow us to test there with less risks.

So in the next step, I will upload a dump-keys from my test cluster on 18.2.7 before upgrading to Squid. Then I'll rerun the command force-remove-snap again and see if that changes anything regarding the mon store. If necessary, I can ask the customer for a dump-key from their secondary site as well.

Actions #32

Updated by Matan Breizman 27 days ago

  • Priority changed from Low to Normal

Eugen Block wrote in #note-31:

Just to be clear, I tested the tool on 17.2.8, it didn't seem to help with the mon store. In the meantime, I upgraded that (single node) cluster to Reef 18.2.7 (to be on the same version as the customer), but apparently, tracker 62983 was released in Squid 19.2.0. And I haven't tested that tool in Reef either. Should I upgrade to latest Squid and rerun the tool?
I will provide the dump-keys output from before and after retrying on Squid.

Allow me to clarify to avoid confusion:

As long as the test cluster is impacted and the tool is available, any version will do (without upgrading).
I would like to examine the key output, please mention if the tool was used or not when sharing the available output.
We can discuss the next steps afterwards.

Actions #33

Updated by Matan Breizman 25 days ago

  • Assignee changed from Matan Breizman to Naveen Naidu

Hey Naveen, as discussed offline, can you please look into this?
Thanks!

Actions #34

Updated by Eugen Block 24 days ago

To get a better impression, I re-ran the tool again. Currently, I'm waiting for the snap trims to finish, there were more than 2.5 million snap_trimq_len per PG, after three days there are 2.4 million left, so that's gonna take a while. But I will share the dump-keys from before the last run.

Actions #36

Updated by Eugen Block 8 days ago

After 16 days of snaptrimming, the combined snap_trimq_len is still > 9 million:

for i in $(ceph pg ls | grep -E "^37\." | cut -d " " -f 1); do ceph pg $i query | jq -r '.snap_trimq_len'; done
314288
175756
865047
259249
221006
281201
0
219341
346177
345651
202414
558398
0
296885
291407
348982
304399
338039
204312
281165
220856
605059
194487
197399
339641
195008
335498
0
194372
544950
218017
520124

So this is gonna take another week or two, I can't tell for sure. I've been playing around with configs to speed up the trimming process since there's no client IO on this test cluster. But it quickly eats up all RAM and leads to OOM killers. So now I have a while loop running, setting nosnaptrim after 90 seconds of snaptrimming, sleep for 60 seconds until the load settles again, and then unset nosnaptrim again. This could be really tough for a production cluster with a lot more snapshots than I had in my lab. Although I expect real hardware to deal way better with snaptrims than my VM. I'll report back once my test cluster is done with snaptrims.

Actions

Also available in: Atom PDF