Skip to content

osd: Remove leaked clone objects (SnapMapper malformed key)#52971

Closed
Matan-B wants to merge 4 commits intoceph:mainfrom
Matan-B:wip-matanb-redelete-snap
Closed

osd: Remove leaked clone objects (SnapMapper malformed key)#52971
Matan-B wants to merge 4 commits intoceph:mainfrom
Matan-B:wip-matanb-redelete-snap

Conversation

@Matan-B
Copy link
Contributor

@Matan-B Matan-B commented Aug 14, 2023

This PR aims to to solve a clone objects leak to clusters affected by: https://tracker.ceph.com/issues/56147

  • Step 1: (for each affected osd)
    Using a new asock command (fix_malformed_snapmapper_keys) that converts the malformed SnapMapper keys to the correct structure.
    To cover clone objects that belong to an EC pool, all possible shard prefixes are inserted. (Total of 128 keys).

  • Step 2: (mon command)
    Separated into osd: Add force-reremove-snap mon command #53235
    Since the correct key exists, ceph osd pool force_reremove_snap <pool> <lower bound> <upper bound> mon command can be used to remove any non-existing (already removed) snapshots.

  • Step 3: (for each affected osd)
    Clean up the extra inserted keys in step 1 after re-deleting the snapshots by using (cleanup_snapmapper_possible_keys)


The leaked clone objects associated to the re-deleted snapshot will be removed once the key is fixed.

2023-08-08T17:43:24.928+0000 7fcb7e0f4700 10 snap_mapper.convert_malformed
2023-08-08T17:43:24.928+0000 7fcb7e0f4700 20 snap_mapper.convert_malformed old key: SNA_2_0000000000000001_
2023-08-08T17:43:24.928+0000 7fcb7e0f4700 20 snap_mapper.convert_malformed converted key: SNA_2_0000000000000001_0000000000000002.ECF34CE6.1.objectone..                                                                           
2023-08-08T17:43:24.929+0000 7fcb7e0f4700 10 snap_mapper.convert_malformed converted 1 keys
2023-08-08T17:43:24.929+0000 7fcb7e0f4700  1 snap_mapper.convert_malformed converted 1 keys in 0.000145347s

TO DO:

  • self-managed snaps
  • EC pool support
  • Remove additional keys inserted after snapshot is removed
  • Release osd_lock on SnapMapper traverse
  • tests

Fixes: https://tracker.ceph.com/issues/62596

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@Matan-B
Copy link
Contributor Author

Matan-B commented Aug 14, 2023

@ronen-fr Thank you for the comments. Since this is PR is in early WIP stage, I prefer to first verify that this approach is indeed valid and reliable (for backporting) before addressing them.
For instance, the get_snap_seq loop will be removed since this is relevant for pool-snaps only.

Do you find any design flaws with this approach?
Note: In order to support ec pools I will add all possible shard prefixes keys - shard id is 8 bit width so it looks possible to add and remove them afterwards.

@ronen-fr
Copy link
Contributor

Do you find any design flaws with this approach?

Working on that...

@athanatos
Copy link
Contributor

athanatos commented Aug 14, 2023

I don't want to add a librados api for this purpose. librados api commands need to be maintained long term and shouldn't be created for one-off fixes. Instead, I'd suggest a mon command. See OSDMonitor::prepare_command_impl, search for "pool rmsnap". We probably want an optional argument for "pool rmsnap" that forces it be placed back into the OSDMap. (edit: "pool rmsnap" is specific to pool snaps, so we'll need a new additional command) You should be able to add a a command to OSDMonitor::prepare_command/preprocess_command like "pool rmsnap" but for an unmanaged snap. See prepare_pool_op/preprocess_pool_op handlers for POOL_OP_DELETE_SNAP for the monitor side of the librados self-managed snapshot removal logic.

As an aside, we do probably want there to be a script which re-removes snaps in smallish batches (probably specified as an argument to the above command) which are confirmed to be trimmed before moving onto the next one. As such, the command should either take a snapshot to remove or a set (comma delimited?). That avoids needing to re-trim all of the previously removed snapshots on a cluster at once, and allows specifically re-removing as little as a single snap.

@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch from 6ddf32b to ecf7a37 Compare August 15, 2023 08:47
@Matan-B Matan-B removed the cephfs Ceph File System label Aug 17, 2023
@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch 2 times, most recently from f0135a3 to cd8ad69 Compare August 21, 2023 12:39
@Matan-B
Copy link
Contributor Author

Matan-B commented Aug 21, 2023

I don't want to add a librados api for this purpose. librados api commands need to be maintained long term and shouldn't be created for one-off fixes. Instead, I'd suggest a mon command.
See OSDMonitor::prepare_command_impl, search for "pool rmsnap". We probably want an optional argument for "pool rmsnap" that forces it be placed back into the OSDMap. (edit: "pool rmsnap" is specific to pool snaps, so we'll need a new additional command) You should be able to add a a command to OSDMonitor::prepare_command/preprocess_command like "pool rmsnap" but for an unmanaged snap.

Moved to a new mon command.

See prepare_pool_op/preprocess_pool_op handlers for POOL_OP_DELETE_SNAP for the monitor side of the librados self-managed snapshot removal logic.

Pool snaps are ok since we have the existing snaps at hand. However, for self-managed ones I couldn't find an elegant way to obtain the existing/purged snaps.
PGMapDigest::purged_snaps is possible to get hold of although it only holds the purged snap ids without the corresponding pool.
pg_info_t also holds complete purged snaps list but I couldn't obtain it from the monitor side. Moreover, traversing through all of the pgs to collect the purged snaps might be too expensive.
I think that the best option to get the existing snap ids is to let the user specify them as a parameter.
This part will handled by a (future) script that will also remove the snapshots in portions.

Not relevant anymore, see new comment.

As an aside, we do probably want there to be a script which re-removes snaps in smallish batches (probably specified as an argument to the above command) which are confirmed to be trimmed before moving onto the next one. As such, the command should either take a snapshot to remove or a set (comma delimited?). That avoids needing to re-trim all of the previously removed snapshots on a cluster at once, and allows specifically re-removing as little as a single snap.

Added a lower and upper snap id bounds.
Note: We avoid removing not yet taken snapshots (UB) even if the upper snap id bound passed exceeds it.


Example usage for rbd:

$ rbd snap ls <pool>/<image>                                                                                                                                                                                                                                                   
SNAPID  NAME   SIZE   PROTECTED  TIMESTAMP                                                                                                                                                                                             
     5  snp14  4 KiB             Mon Aug 21 11:47:07 2023                                                                                                                                                                              
     6  snp15  4 KiB             Mon Aug 21 11:47:09 2023                                                                                                                                                                              

$ ceph osd pool rmsnap_again <pool> 0 10 5,6                                                                                                                              
                                                                                                                   
removing snap 1 again from pool 2
removing snap 2 again from pool 2
removing snap 3 again from pool 2
removing snap 4 again from pool 2
snap 5 was specified and won't be removed
snap 6 was specified and won't be removed

ceph osd pool rmsnap_again <pool> <lower_bound> <higher_bound> <existing snap ids>

@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch from cd8ad69 to f7d9e23 Compare August 23, 2023 17:39
@Matan-B
Copy link
Contributor Author

Matan-B commented Aug 23, 2023

@athanatos, I updated the monitor command to simply iterate up to the latest snap_seq (maintained in both snapshots types) and to redelete only the purged snapshots which are obtained from the monitor db.
The current command requires only the pool name and lower/upper snap id bounds to redelete.

Note: In earlier discussions I assumed that the purged snaps kv entries are not complete. However, that seem to be wrong as they are not being trimmed in any way.

@@ -3093,6 +3093,22 @@ will start to track new ops received afterwards.";
pg_recovery_stats.reset();
}

Copy link
Contributor

@athanatos athanatos Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this command is supposed to work. Add brief, precise comments to the header declarations of convert_malformed_key and convert_malformed in the format:

/**
 * method_name
 *
 * Explanation of what method_name *does*, not how it works.  A reader should be able
 * to use this method correctly by reading this comment without needing to read the
 * implementation.
 */

If you need to elaborate on something in the implementation, do it in the definition body.

From what I can tell from the implementation, it looks like the goal here is to remove the malformed key and write back (for an ec pool) all 128 possible legacy keys. What's the plan for ultimately removing the extras? The OSD has enough information to map the hobject_t to a PG. From there, if the OSD only has a single shard of the PG (which it almost always will), you can infer the shard. You could pass a lambda in to do that mapping. You could simply skip any objects in PGs that happen to have multiple shards present (lambda returns NOSHARD or something).

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch 6 times, most recently from e6b7cd9 to 009fbfe Compare August 29, 2023 11:57
@Matan-B
Copy link
Contributor Author

Matan-B commented Aug 29, 2023

From what I can tell from the implementation, it looks like the goal here is to remove the malformed key and write back (for an ec pool) all 128 possible legacy keys. What's the plan for ultimately removing the extras?

6172c30

The OSD has enough information to map the hobject_t to a PG. From there, if the OSD only has a single shard of the PG (which it almost always will), you can infer the shard. You could pass a lambda in to do that mapping. You could simply skip any objects in PGs that happen to have multiple shards present (lambda returns NOSHARD or something).

I'm not sure I understand how, convert_malformed is quite limited since it's static. In a normal scenario the SnapMapper instance is constructed with the shard_id from pg. Do you mean getting all the pgs (_get_pgs) to be used in the conversion?

I'm worried about holding the osd lock during the scan as well. While you have bounded the number of keys to remove, you're going to keep re-reading the SnapMapper up to that point. Is it the case that the malformed keys always sort at the start of the region? If so, that's an invariant you should explain in the implementation definition (not the header).

Malformed keys aren't sorted since they share the same prefix with the correct ones.
I added a commit to release the lock every transaction_size keys scanned 2979e2b


Other comments were addressed.

@Matan-B Matan-B requested a review from athanatos August 29, 2023 14:32
/**
  * convert_malformed
  *
  * Scans the SnapMapper for malformed keys created by a bug
  * during upgrade from N (and earlier) to O (up to 16.2.11).
  * Each detected key is replaced with a valid key instead.
  * See: https://tracker.ceph.com/issues/62596
  *
  */

  * TODO: use the OSD to get information to map the hobject_t to a PG
  *       instead of adding all possible EC shard prefixes.

Converts each
<MAPPING_PREFIX><poolid>_<snapid>_ ("malformed key")
to
<LEGACY_MAPPING_PREFIX><snapid>_<shard_id>_<hobject_t::to_str()>

by running:
`ceph daemon osd.<id> fix_malformed_snapmapper_keys`
to each (affected) osd in the acting set.

Fixes: Part 1/3 - https://tracker.ceph.com/issues/62596

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch from 009fbfe to adf25db Compare August 30, 2023 11:55
@Matan-B Matan-B changed the title [WIP] osd: Remove leaked clone objects (SnapMapper malformed key) osd: Remove leaked clone objects (SnapMapper malformed key) Aug 30, 2023
@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch from adf25db to 2979e2b Compare August 30, 2023 13:26
/**
  * remove_possible_keys
  *
  * SnapMapper::convert_malformed() inserts all the possible keys.
  * Remove the extra keys inserted once `force_reremove_snap` was used.
  *
  */

  * TODO: bound to transaction size

Fixes: Part 3/3 - https://tracker.ceph.com/issues/62596

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-redelete-snap branch from 2979e2b to ba6c967 Compare August 31, 2023 08:39
@Matan-B
Copy link
Contributor Author

Matan-B commented Aug 31, 2023

Step 2/3 was separated into #53235 since it may stand as a standalone step which unnecessarily relates to the issue addressed in this PR.

@Matan-B
Copy link
Contributor Author

Matan-B commented Aug 31, 2023

@athanatos, the malformed keys are not unique and therefore are useless. The last malformed key will override the previous ones, marking as DNM until verified.

@Matan-B Matan-B added the DNM label Aug 31, 2023
@Matan-B
Copy link
Contributor Author

Matan-B commented Sep 6, 2023

Closing, See tracker for resolving this issue.
https://tracker.ceph.com/issues/62596

@Matan-B Matan-B closed this Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants