rbd-mirror: bad state and crashes in snapshot-based mirroring by dillaman · Pull Request #38517 · ceph/ceph

dillaman · 2020-12-10T04:28:15Z

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

trociny

Otherwise LGTM

trociny · 2020-12-10T12:26:46Z

src/librbd/mirror/snapshot/UnlinkPeerRequest.cc

-      info->mirror_peer_uuids.count(m_mirror_peer_uuid) == 0) {
+  if ((info->mirror_peer_uuids.size() > 1 ||
+       info->mirror_peer_uuids.count(m_mirror_peer_uuid) == 0) &&
+      (!info->mirror_peer_uuids.empty() || !m_newer_mirror_snapshots)) {


Just thinking, could it be made more readable? E.g. it seems to be equivalent of the below:

info->mirror_peer_uuids.erase(m_mirror_peer_uuid); if (!info->mirror_peer_uuids.empty() || !m_newer_mirror_snapshots) { // skip }

I think the matching simplification would be:

auto removed_count = info->mirror_peer_uuids.erase(m_mirror_peer_uuid); if (!info->mirror_peer_uuids.empty() || (!m_newer_mirror_snapshots && removed_count == 0))) {

but your version should also be correct since even if we don't remove our peer uuid, we probably shouldn't remove the most recent mirror snapshot.

... but now that I think about it, we don't want to tweak the data structure if we aren't going to remove it, so info would need to be a copy as well.

I was thinking it would be safe to tweak (remove a peer from) snapshot_namespace, because it could be only used for snap_remove operation and it would not affect this?

Anyway I am fine with the current version.

If the mirror peer set is (incorrectly) empty, it's not currently possible for the unlink peer state machine to properly delete the snapshot. This can result in a recursive loop between the create primary snapshot state machine and the unlink peer state machine until the stack depth grows too large. Fixes: https://tracker.ceph.com/issues/48525 Signed-off-by: Jason Dillaman <dillaman@redhat.com>

The snapshot-based mirroring replayer should only attempt to unlink from any snapshots that are older than the end remote snapshot id to prevent the remote side from incorrectly deleted the snapshot. Fixes: https://tracker.ceph.com/issues/48527 Signed-off-by: Jason Dillaman <dillaman@redhat.com>

dillaman added bug-fix rbd tests labels Dec 10, 2020

trociny approved these changes Dec 10, 2020

View reviewed changes

dillaman force-pushed the wip-48525 branch from 660daf8 to d698a02 Compare December 10, 2020 13:32

dillaman force-pushed the wip-48525 branch from d698a02 to 78f8abc Compare December 10, 2020 13:34

trociny merged commit 6283c77 into ceph:master Dec 11, 2020

trociny mentioned this pull request Dec 11, 2020

librbd: fix sporadic failures in TestMigration.StressLive #38494

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbd-mirror: bad state and crashes in snapshot-based mirroring#38517

rbd-mirror: bad state and crashes in snapshot-based mirroring#38517
trociny merged 2 commits intoceph:masterfrom
dillaman:wip-48525

dillaman commented Dec 10, 2020

Uh oh!

trociny left a comment

Uh oh!

trociny Dec 10, 2020

Uh oh!

dillaman Dec 10, 2020

Uh oh!

dillaman Dec 10, 2020

Uh oh!

trociny Dec 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dillaman commented Dec 10, 2020

Checklist

Uh oh!

trociny left a comment

Choose a reason for hiding this comment

Uh oh!

trociny Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

dillaman Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

dillaman Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

trociny Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants