osd: Remove leaked clone objects (SnapMapper malformed key) by Matan-B · Pull Request #52971 · ceph/ceph

Matan-B · 2023-08-14T12:18:23Z

This PR aims to to solve a clone objects leak to clusters affected by: https://tracker.ceph.com/issues/56147

Step 1: (for each affected osd)
Using a new asock command (fix_malformed_snapmapper_keys) that converts the malformed SnapMapper keys to the correct structure.
To cover clone objects that belong to an EC pool, all possible shard prefixes are inserted. (Total of 128 keys).
Step 2: (mon command)
Separated into osd: Add force-reremove-snap mon command #53235
Since the correct key exists, ceph osd pool force_reremove_snap <pool> <lower bound> <upper bound> mon command can be used to remove any non-existing (already removed) snapshots.
Step 3: (for each affected osd)
Clean up the extra inserted keys in step 1 after re-deleting the snapshots by using (cleanup_snapmapper_possible_keys)

The leaked clone objects associated to the re-deleted snapshot will be removed once the key is fixed.

2023-08-08T17:43:24.928+0000 7fcb7e0f4700 10 snap_mapper.convert_malformed
2023-08-08T17:43:24.928+0000 7fcb7e0f4700 20 snap_mapper.convert_malformed old key: SNA_2_0000000000000001_
2023-08-08T17:43:24.928+0000 7fcb7e0f4700 20 snap_mapper.convert_malformed converted key: SNA_2_0000000000000001_0000000000000002.ECF34CE6.1.objectone..                                                                           
2023-08-08T17:43:24.929+0000 7fcb7e0f4700 10 snap_mapper.convert_malformed converted 1 keys
2023-08-08T17:43:24.929+0000 7fcb7e0f4700  1 snap_mapper.convert_malformed converted 1 keys in 0.000145347s

TO DO:

Fixes: https://tracker.ceph.com/issues/62596

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

src/librados/IoCtxImpl.cc

src/mon/OSDMonitor.cc

src/osd/SnapMapper.cc

src/osd/osd_types.cc

src/osdc/Objecter.cc

Matan-B · 2023-08-14T14:16:01Z

@ronen-fr Thank you for the comments. Since this is PR is in early WIP stage, I prefer to first verify that this approach is indeed valid and reliable (for backporting) before addressing them.
For instance, the get_snap_seq loop will be removed since this is relevant for pool-snaps only.

Do you find any design flaws with this approach?
Note: In order to support ec pools I will add all possible shard prefixes keys - shard id is 8 bit width so it looks possible to add and remove them afterwards.

ronen-fr · 2023-08-14T14:36:21Z

Do you find any design flaws with this approach?

Working on that...

athanatos · 2023-08-14T22:27:04Z

I don't want to add a librados api for this purpose. librados api commands need to be maintained long term and shouldn't be created for one-off fixes. Instead, I'd suggest a mon command. ~~See OSDMonitor::prepare_command_impl, search for "pool rmsnap". We probably want an optional argument for "pool rmsnap" that forces it be placed back into the OSDMap.~~ (edit: "pool rmsnap" is specific to pool snaps, so we'll need a new additional command) You should be able to add a a command to OSDMonitor::prepare_command/preprocess_command like "pool rmsnap" but for an unmanaged snap. See prepare_pool_op/preprocess_pool_op handlers for POOL_OP_DELETE_SNAP for the monitor side of the librados self-managed snapshot removal logic.

As an aside, we do probably want there to be a script which re-removes snaps in smallish batches (probably specified as an argument to the above command) which are confirmed to be trimmed before moving onto the next one. As such, the command should either take a snapshot to remove or a set (comma delimited?). That avoids needing to re-trim all of the previously removed snapshots on a cluster at once, and allows specifically re-removing as little as a single snap.

Matan-B · 2023-08-21T12:41:31Z

I don't want to add a librados api for this purpose. librados api commands need to be maintained long term and shouldn't be created for one-off fixes. Instead, I'd suggest a mon command.
~~See OSDMonitor::prepare_command_impl, search for "pool rmsnap". We probably want an optional argument for "pool rmsnap" that forces it be placed back into the OSDMap.~~ (edit: "pool rmsnap" is specific to pool snaps, so we'll need a new additional command) You should be able to add a a command to OSDMonitor::prepare_command/preprocess_command like "pool rmsnap" but for an unmanaged snap.

Moved to a new mon command.

See prepare_pool_op/preprocess_pool_op handlers for POOL_OP_DELETE_SNAP for the monitor side of the librados self-managed snapshot removal logic.

Pool snaps are ok since we have the existing snaps at hand. However, for self-managed ones I couldn't find an elegant way to obtain the existing/purged snaps.
PGMapDigest::purged_snaps is possible to get hold of although it only holds the purged snap ids without the corresponding pool.
pg_info_t also holds complete purged snaps list but I couldn't obtain it from the monitor side. Moreover, traversing through all of the pgs to collect the purged snaps might be too expensive.
I think that the best option to get the existing snap ids is to let the user specify them as a parameter.
This part will handled by a (future) script that will also remove the snapshots in portions.

Not relevant anymore, see new comment.

As an aside, we do probably want there to be a script which re-removes snaps in smallish batches (probably specified as an argument to the above command) which are confirmed to be trimmed before moving onto the next one. As such, the command should either take a snapshot to remove or a set (comma delimited?). That avoids needing to re-trim all of the previously removed snapshots on a cluster at once, and allows specifically re-removing as little as a single snap.

Added a lower and upper snap id bounds.
Note: We avoid removing not yet taken snapshots (UB) even if the upper snap id bound passed exceeds it.

Example usage for rbd:

$ rbd snap ls <pool>/<image>                                                                                                                                                                                                                                                   
SNAPID  NAME   SIZE   PROTECTED  TIMESTAMP                                                                                                                                                                                             
     5  snp14  4 KiB             Mon Aug 21 11:47:07 2023                                                                                                                                                                              
     6  snp15  4 KiB             Mon Aug 21 11:47:09 2023                                                                                                                                                                              

$ ceph osd pool rmsnap_again <pool> 0 10 5,6                                                                                                                              
                                                                                                                   
removing snap 1 again from pool 2
removing snap 2 again from pool 2
removing snap 3 again from pool 2
removing snap 4 again from pool 2
snap 5 was specified and won't be removed
snap 6 was specified and won't be removed

ceph osd pool rmsnap_again <pool> <lower_bound> <higher_bound> <existing snap ids>

Matan-B · 2023-08-23T17:47:56Z

@athanatos, I updated the monitor command to simply iterate up to the latest snap_seq (maintained in both snapshots types) and to redelete only the purged snapshots which are obtained from the monitor db.
The current command requires only the pool name and lower/upper snap id bounds to redelete.

Note: In earlier discussions I assumed that the purged snaps kv entries are not complete. However, that seem to be wrong as they are not being trimmed in any way.

athanatos · 2023-08-24T18:43:59Z

src/osd/OSD.cc

@@ -3093,6 +3093,22 @@ will start to track new ops received afterwards.";
    pg_recovery_stats.reset();
  }



I don't understand how this command is supposed to work. Add brief, precise comments to the header declarations of convert_malformed_key and convert_malformed in the format:

/** * method_name * * Explanation of what method_name *does*, not how it works. A reader should be able * to use this method correctly by reading this comment without needing to read the * implementation. */

If you need to elaborate on something in the implementation, do it in the definition body.

From what I can tell from the implementation, it looks like the goal here is to remove the malformed key and write back (for an ec pool) all 128 possible legacy keys. What's the plan for ultimately removing the extras? The OSD has enough information to map the hobject_t to a PG. From there, if the OSD only has a single shard of the PG (which it almost always will), you can infer the shard. You could pass a lambda in to do that mapping. You could simply skip any objects in PGs that happen to have multiple shards present (lambda returns NOSHARD or something).

src/mon/MonCommands.h

src/mon/OSDMonitor.cc

Matan-B · 2023-08-29T14:31:56Z

From what I can tell from the implementation, it looks like the goal here is to remove the malformed key and write back (for an ec pool) all 128 possible legacy keys. What's the plan for ultimately removing the extras?

6172c30

The OSD has enough information to map the hobject_t to a PG. From there, if the OSD only has a single shard of the PG (which it almost always will), you can infer the shard. You could pass a lambda in to do that mapping. You could simply skip any objects in PGs that happen to have multiple shards present (lambda returns NOSHARD or something).

I'm not sure I understand how, convert_malformed is quite limited since it's static. In a normal scenario the SnapMapper instance is constructed with the shard_id from pg. Do you mean getting all the pgs (_get_pgs) to be used in the conversion?

I'm worried about holding the osd lock during the scan as well. While you have bounded the number of keys to remove, you're going to keep re-reading the SnapMapper up to that point. Is it the case that the malformed keys always sort at the start of the region? If so, that's an invariant you should explain in the implementation definition (not the header).

Malformed keys aren't sorted since they share the same prefix with the correct ones.
I added a commit to release the lock every transaction_size keys scanned 2979e2b

Other comments were addressed.

/** * convert_malformed * * Scans the SnapMapper for malformed keys created by a bug * during upgrade from N (and earlier) to O (up to 16.2.11). * Each detected key is replaced with a valid key instead. * See: https://tracker.ceph.com/issues/62596 * */ * TODO: use the OSD to get information to map the hobject_t to a PG * instead of adding all possible EC shard prefixes. Converts each <MAPPING_PREFIX><poolid>_<snapid>_ ("malformed key") to <LEGACY_MAPPING_PREFIX><snapid>_<shard_id>_<hobject_t::to_str()> by running: `ceph daemon osd.<id> fix_malformed_snapmapper_keys` to each (affected) osd in the acting set. Fixes: Part 1/3 - https://tracker.ceph.com/issues/62596 Signed-off-by: Matan Breizman <mbreizma@redhat.com>

/** * remove_possible_keys * * SnapMapper::convert_malformed() inserts all the possible keys. * Remove the extra keys inserted once `force_reremove_snap` was used. * */ * TODO: bound to transaction size Fixes: Part 3/3 - https://tracker.ceph.com/issues/62596 Signed-off-by: Matan Breizman <mbreizma@redhat.com>

Signed-off-by: Matan Breizman <mbreizma@redhat.com>

Matan-B · 2023-08-31T08:43:04Z

Step 2/3 was separated into #53235 since it may stand as a standalone step which unnecessarily relates to the issue addressed in this PR.

Matan-B · 2023-08-31T14:41:50Z

@athanatos, the malformed keys are not unique and therefore are useless. The last malformed key will override the previous ones, marking as DNM until verified.

Matan-B · 2023-09-06T11:27:55Z

Closing, See tracker for resolving this issue.
https://tracker.ceph.com/issues/62596

github-actions bot added cephfs Ceph File System common core mon labels Aug 14, 2023

Matan-B force-pushed the wip-matanb-redelete-snap branch from 67bd50a to e31aa95 Compare August 14, 2023 12:45

Matan-B requested review from athanatos, ronen-fr and rzarzynski August 14, 2023 12:46