mon: Add force-remove-snap mon command#53545
Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
ae85111 to
70c85ab
Compare
70c85ab to
99d7d63
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
99d7d63 to
b76825b
Compare
b76825b to
cd6fd40
Compare
cd6fd40 to
c979234
Compare
c979234 to
96cc714
Compare
|
Would be great to have the teuthology coverage for the new command. |
I added teuthology coverage in the second commit: 57da5c5 |
96cc714 to
57da5c5
Compare
athanatos
left a comment
There was a problem hiding this comment.
One substantive question about the --purged-snaps-only flag implementation along with some clarification requests.
| if (res == 0) { | ||
| ss << "snapids: " << i << "was already marked as purged"; | ||
| // Reremove the already purged_snaps | ||
| if (purged_snaps_only) { |
There was a problem hiding this comment.
I don't understand this condition. In this branch, the snap in question has already been purged. Seems like that means we should re-remove whether or not --purged-snaps-only was passed?
There was a problem hiding this comment.
When using the purged_snaps_only variant, only the snaps which were already marked as purged are removed again. I have added an explanatory comment to emphasize this.
There was a problem hiding this comment.
The purged-snaps-only is considered "Experimental" (and not default) because of the Kludge mentioned with this variant later on. Adding the snap id both to new_purged_snaps and to new_removed_snaps is not conventional and may cause issues which are less obvious to identify easily.
Where the "Default" variant is imitating the normal behavior of removing snap ids and is less prone to unexpected results.
| force_removed_snapids.insert(i); | ||
| } | ||
| } else { | ||
| if (!purged_snaps_only) { |
There was a problem hiding this comment.
By contrast, this condition looks right. The snap wasn't purged so we only want to remove it if --purged-snaps-only wasn't passed.
In other words -- the set of snaps re-removed with --purged-snaps-only should be a subset of the set purged without it, right?
This is the kind of clarification the comment I requested at the top should address.
There was a problem hiding this comment.
By contrast, this condition looks right. The snap wasn't purged so we only want to remove it if --purged-snaps-only wasn't passed.
Right. However, the default/non-default variant are mutually exclusive. (Mentioned in comment)
57da5c5 to
b370cd5
Compare
```
/*
* Forces removal of snapshots in the range of
* [lower_snapid_bound, upper_snapid_bound) on pool <pool>
* in order to cause OSDs to re-trim them.
* The command has two mutually exclusive variants:
* * Default: All the snapids in the given range which are not
* marked as purged in the Monitor will be removed. Mostly useful
* for cases in which the snapid is leaked in the client side.
* See: https://tracker.ceph.com/issues/64646
* * (Experimental) purged-snaps-only: Adding this flag will result
* in the reremoval of snapids in the given range.
* Only the snapids which are *already* marked as purged in the
* Monitor will be removed again. This may be useful for cases in
* which we would like to trigger OSD snap trimming again.
*/
```
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
b370cd5 to
0ee333c
Compare
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
0ee333c to
cafad77
Compare
|
jenkins test make check |
ljflores
left a comment
There was a problem hiding this comment.
Hey @Matan-B this failure in teuthology looks related to the snaps-few-objects-redelete change:
description: rados/thrash/{0-size-min-size-overrides/2-size-2-min-size 1-pg-log-overrides/normal_pg_log
2-recovery-overrides/{more-async-partial-recovery} 3-scrub-overrides/{max-simultaneous-scrubs-5}
backoff/normal ceph clusters/{fixed-4 openstack} crc-failures/bad_map_crc_failure
d-balancer/upmap-read mon_election/classic msgr-failures/fastclose msgr/async-v1only
objectstore/bluestore-comp-zstd rados supported-random-distro$/{ubuntu_latest} thrashers/pggrow
thrashosds-health workloads/snaps-few-objects-redelete}
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652478
2024-04-12T01:06:16.699 DEBUG:teuthology.orchestra.run.smithi042:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool force-remove-snap unique_pool_0 --purged-snaps-only
...
2024-04-12T01:09:09.030 INFO:tasks.rados.rados.0.smithi176.stdout:1798: finishing write tid 1 to smithi17628761-15
2024-04-12T01:09:09.030 INFO:tasks.rados.rados.0.smithi176.stdout:1798: finishing write tid 2 to smithi17628761-15
2024-04-12T01:09:09.030 INFO:tasks.rados.rados.0.smithi176.stdout:1798: finishing write tid 3 to smithi17628761-15
2024-04-12T01:09:09.030 INFO:tasks.rados.rados.0.smithi176.stdout:1798: finishing write tid 4 to smithi17628761-15
2024-04-12T01:09:09.030 INFO:tasks.rados.rados.0.smithi176.stdout:1798: oid 15 updating version 0 to 391
2024-04-12T01:09:09.030 INFO:tasks.rados.rados.0.smithi176.stdout:1798: oid 15 updating version 391 to 392
2024-04-12T01:09:09.031 INFO:tasks.rados.rados.0.smithi176.stdout:1798: oid 15 version 392 is already newer than 390
2024-04-12T01:09:09.031 INFO:tasks.rados.rados.0.smithi176.stdout:update_object_version oid 15 v 392 (ObjNum 642 snap 171 seq_num 642) dirty exists
2024-04-12T01:09:09.031 INFO:tasks.rados.rados.0.smithi176.stdout:1798: left oid 15 (ObjNum 642 snap 171 seq_num 642)
2024-04-12T01:09:09.034 ERROR:teuthology.orchestra.daemon.state:Failed to send signal 1: None
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_6c637841c215537a4502385240412f1966e0faab/teuthology/orchestra/daemon/state.py", line 108, in signal
self.proc.stdin.write(struct.pack('!b', sig))
File "/home/teuthworker/src/git.ceph.com_teuthology_6c637841c215537a4502385240412f1966e0faab/virtualenv/lib/python3.8/site-packages/paramiko/file.py", line 385, in write
raise IOError("File is closed")
OSError: File is closed
| @@ -0,0 +1,23 @@ | |||
| overrides: | |||
There was a problem hiding this comment.
Also with this test:
description: rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log
2-recovery-overrides/{more-async-partial-recovery} 3-scrub-overrides/{max-simultaneous-scrubs-5}
backoff/peering ceph clusters/{fixed-4 openstack} crc-failures/default d-balancer/crush-compat
mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-comp-lz4
rados supported-random-distro$/{ubuntu_latest} thrashers/careful thrashosds-health
workloads/pool-snaps-few-objects-redelete}
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652491
2024-04-12T01:38:22.133 INFO:tasks.ceph.osd.7.smithi169.stderr:./src/osd/PG.cc: 1901: FAILED ceph_assert(!bad || !cct->_conf->osd_debug_verify_cached_snaps)
2024-04-12T01:38:22.133 INFO:tasks.ceph.osd.7.smithi169.stderr:
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: ceph version 19.0.0-2838-ga5074d45 (a5074d4516d566e9d8b6aec912f26afd099de101) squid (dev)
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x118) [0x55691a976362]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 2: ceph-osd(+0x3f6519) [0x55691a976519]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 3: ceph-osd(+0x38f81c) [0x55691a90f81c]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 4: (PeeringState::Active::react(PeeringState::AdvMap const&)+0x19e) [0x55691ad92e5e]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 5: ceph-osd(+0x840811) [0x55691adc0811]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 6: (PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x266) [0x55691ad596e6]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 7: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0xfb) [0x55691ab7a76b]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x39c) [0x55691aaf318c]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x237) [0x55691ab04957]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x55691ad374f1]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xab3) [0x55691ab0e6e3]
2024-04-12T01:38:22.134 INFO:tasks.ceph.osd.7.smithi169.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55691b0024e3]
2024-04-12T01:38:22.135 INFO:tasks.ceph.osd.7.smithi169.stderr: 13: ceph-osd(+0xa82a44) [0x55691b002a44]
2024-04-12T01:38:22.135 INFO:tasks.ceph.osd.7.smithi169.stderr: 14: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9933f96b43]
2024-04-12T01:38:22.135 INFO:tasks.ceph.osd.7.smithi169.stderr: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f9934028a00]
The fix for https://tracker.ceph.com/issues/64347 is already included in the test branch, so this seems like a new crash related to this particular test.
| @@ -0,0 +1,19 @@ | |||
| overrides: | |||
There was a problem hiding this comment.
This failure also looks related to this test:
/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652505
2024-04-12T01:51:20.020 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph-c_a5074d4516d566e9d8b6aec912f26afd099de101/qa/tasks/ceph_manager.py", line 190, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph-c_a5074d4516d566e9d8b6aec912f26afd099de101/qa/tasks/ceph_manager.py", line 1483, in _do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph-c_a5074d4516d566e9d8b6aec912f26afd099de101/qa/tasks/ceph_manager.py", line 1321, in <lambda>
self.inject_pause(key,
File "/home/teuthworker/src/github.com_ceph_ceph-c_a5074d4516d566e9d8b6aec912f26afd099de101/qa/tasks/ceph_manager.py", line 1065, in inject_pause
self.ceph_manager.set_config(the_one, **{conf_key: duration})
File "/home/teuthworker/src/github.com_ceph_ceph-c_a5074d4516d566e9d8b6aec912f26afd099de101/qa/tasks/ceph_manager.py", line 2093, in set_config
self.wait_run_admin_socket(
File "/home/teuthworker/src/github.com_ceph_ceph-c_a5074d4516d566e9d8b6aec912f26afd099de101/qa/tasks/ceph_manager.py", line 2050, in wait_run_admin_socket
raise Exception('timed out waiting for admin_socket '
Exception: timed out waiting for admin_socket to appear after osd.4 restart
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Variant 1 (Default):
Remove all snapids in the range which are not marked as purged.
Useful for leaks such as: https://tracker.ceph.com/issues/64646
Example usage:
Case 2 - purged-snaps-only:
Reremove snap ids in the range which are already marked as purged.
Notes:
Initial testing looks to be stable:
https://pulpito.ceph.com/matan-2023-09-19_13:58:13-rados:thrash-wip-matanb-reremove-snap-only-distro-default-smithi/7398191/
Original PR : #53235
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows