Skip to content

rbd-mirror: bad state and crashes in snapshot-based mirroring#38517

Merged
trociny merged 2 commits intoceph:masterfrom
dillaman:wip-48525
Dec 11, 2020
Merged

rbd-mirror: bad state and crashes in snapshot-based mirroring#38517
trociny merged 2 commits intoceph:masterfrom
dillaman:wip-48525

Conversation

@dillaman
Copy link

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

info->mirror_peer_uuids.count(m_mirror_peer_uuid) == 0) {
if ((info->mirror_peer_uuids.size() > 1 ||
info->mirror_peer_uuids.count(m_mirror_peer_uuid) == 0) &&
(!info->mirror_peer_uuids.empty() || !m_newer_mirror_snapshots)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking, could it be made more readable? E.g. it seems to be equivalent of the below:

       info->mirror_peer_uuids.erase(m_mirror_peer_uuid);
       if (!info->mirror_peer_uuids.empty() || !m_newer_mirror_snapshots) {
         // skip
       }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the matching simplification would be:

auto removed_count = info->mirror_peer_uuids.erase(m_mirror_peer_uuid);
if (!info->mirror_peer_uuids.empty() ||
   (!m_newer_mirror_snapshots && removed_count == 0))) {

but your version should also be correct since even if we don't remove our peer uuid, we probably shouldn't remove the most recent mirror snapshot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but now that I think about it, we don't want to tweak the data structure if we aren't going to remove it, so info would need to be a copy as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it would be safe to tweak (remove a peer from) snapshot_namespace, because it could be only used for snap_remove operation and it would not affect this?

Anyway I am fine with the current version.

If the mirror peer set is (incorrectly) empty, it's not currently
possible for the unlink peer state machine to properly delete the
snapshot. This can result in a recursive loop between the create
primary snapshot state machine and the unlink peer state machine
until the stack depth grows too large.

Fixes: https://tracker.ceph.com/issues/48525
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The snapshot-based mirroring replayer should only attempt to unlink
from any snapshots that are older than the end remote snapshot id to
prevent the remote side from incorrectly deleted the snapshot.

Fixes: https://tracker.ceph.com/issues/48527
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants