rbd-mirror: bad state and crashes in snapshot-based mirroring#38517
rbd-mirror: bad state and crashes in snapshot-based mirroring#38517trociny merged 2 commits intoceph:masterfrom
Conversation
| info->mirror_peer_uuids.count(m_mirror_peer_uuid) == 0) { | ||
| if ((info->mirror_peer_uuids.size() > 1 || | ||
| info->mirror_peer_uuids.count(m_mirror_peer_uuid) == 0) && | ||
| (!info->mirror_peer_uuids.empty() || !m_newer_mirror_snapshots)) { |
There was a problem hiding this comment.
Just thinking, could it be made more readable? E.g. it seems to be equivalent of the below:
info->mirror_peer_uuids.erase(m_mirror_peer_uuid);
if (!info->mirror_peer_uuids.empty() || !m_newer_mirror_snapshots) {
// skip
}
There was a problem hiding this comment.
I think the matching simplification would be:
auto removed_count = info->mirror_peer_uuids.erase(m_mirror_peer_uuid);
if (!info->mirror_peer_uuids.empty() ||
(!m_newer_mirror_snapshots && removed_count == 0))) {
but your version should also be correct since even if we don't remove our peer uuid, we probably shouldn't remove the most recent mirror snapshot.
There was a problem hiding this comment.
... but now that I think about it, we don't want to tweak the data structure if we aren't going to remove it, so info would need to be a copy as well.
There was a problem hiding this comment.
I was thinking it would be safe to tweak (remove a peer from) snapshot_namespace, because it could be only used for snap_remove operation and it would not affect this?
Anyway I am fine with the current version.
If the mirror peer set is (incorrectly) empty, it's not currently possible for the unlink peer state machine to properly delete the snapshot. This can result in a recursive loop between the create primary snapshot state machine and the unlink peer state machine until the stack depth grows too large. Fixes: https://tracker.ceph.com/issues/48525 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The snapshot-based mirroring replayer should only attempt to unlink from any snapshots that are older than the end remote snapshot id to prevent the remote side from incorrectly deleted the snapshot. Fixes: https://tracker.ceph.com/issues/48527 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox