Project

General

Profile

Actions

Bug #45072

closed

[rbd-mirror] image replayer stop might race with remove and instace replayer shut down

Added by Jason Dillaman almost 6 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Jason Dillaman
Target version:
-
% Done:

0%

Source:
Q/A
Backport:
mimic,nautilus,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Fixed In:
v16.0.0-880-g5e540f1191
Released In:
v16.2.0~2813
Upkeep Timestamp:
2025-07-15T01:14:56+00:00

Description

http://qa-proxy.ceph.com/teuthology/jdillaman-2020-04-09_09:42:22-rbd-wip-jd-testing-distro-basic-smithi/4938679/teuthology.log
http://qa-proxy.ceph.com/teuthology/jdillaman-2020-04-09_09:42:22-rbd-wip-jd-testing-distro-basic-smithi/4938684/teuthology.log

A notification to remove an image was received:

  -199> 2020-04-11T01:06:43.872+0000 7f5c9f274700 10 rbd::mirror::InstanceReplayer: 0x5601b046de00 remove_peer_image: global_image_id=cda5e6c3-dc44-445a-b2db-6bcb8717a165, peer_mirror_uuid=5c6a4b9c-fb17-4d0e-97ff-8b6996184ee9
  -198> 2020-04-11T01:06:43.872+0000 7f5c9f274700 10 rbd::mirror::ImageReplayer: 0x5601b4a7db80 [3/cda5e6c3-dc44-445a-b2db-6bcb8717a165] stop: on_finish=0x5601b46a1060, manual=0, desc=
  -197> 2020-04-11T01:06:43.872+0000 7f5c9f274700 10 rbd::mirror::ImageReplayer: 0x5601b4a7db80 [3/cda5e6c3-dc44-445a-b2db-6bcb8717a165] stop: canceling start
  -196> 2020-04-11T01:06:43.872+0000 7f5c9f274700 10 rbd::mirror::ImageReplayer: 0x5601b4a7db80 [3/cda5e6c3-dc44-445a-b2db-6bcb8717a165] stop: canceling bootstrap

Followed shortly be a SIGTERM from the thrasher test which attempted a second stop request (which failed):

   -17> 2020-04-11T01:06:43.921+0000 7f5caff34680 10 rbd::mirror::ImageReplayer: 0x5601b4a7db80 [3/cda5e6c3-dc44-445a-b2db-6bcb8717a165] stop: on_finish=0x5601b04095a0, manual=1, desc=
   -16> 2020-04-11T01:06:43.921+0000 7f5c9f274700 10 rbd::mirror::NamespaceReplayer: 0x5601b1d981a0 handle_stop_instance_replayer: r=-22
   -15> 2020-04-11T01:06:43.921+0000 7f5c9f274700 -1 rbd::mirror::NamespaceReplayer: 0x5601b1d981a0 handle_stop_instance_replayer: error stopping instance replayer: (22) Invalid argument
   -14> 2020-04-11T01:06:43.921+0000 7f5c9f274700 10 rbd::mirror::NamespaceReplayer: 0x5601b1d981a0 shut_down_instance_watcher:

However, the NamespaceReplayer ignored the error and continued to shut down the InstanceWatcher while it still had registered callbacks from the ImageReplayer that was shutting down:

    -1> 2020-04-11T01:06:43.927+0000 7f5c9f274700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-418-g83b5036/rpm/el8/BUILD/ceph-16.0.0-418-g83b5036/src/tools/rbd_mirror/InstanceWatcher.cc: In function 'rbd::mirror::InstanceWatcher<ImageCtxT>::~InstanceWatcher() [with ImageCtxT = librbd::ImageCtx]' thread 7f5c9f274700 time 2020-04-11T01:06:43.927299+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-418-g83b5036/rpm/el8/BUILD/ceph-16.0.0-418-g83b5036/src/tools/rbd_mirror/InstanceWatcher.cc: 340: FAILED ceph_assert(m_requests.empty())


 ceph version 16.0.0-418-g83b5036 (83b50362f2e3cb2eb00db134ab87c51b5452223e) octopus (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f5ca69c4030]
 2: (()+0x27b24a) [0x7f5ca69c424a]
 3: (rbd::mirror::InstanceWatcher<librbd::ImageCtx>::~InstanceWatcher()+0x145) [0x5601aef1fe45]
 4: (rbd::mirror::InstanceWatcher<librbd::ImageCtx>::~InstanceWatcher()+0xd) [0x5601aef1fe8d]
 5: (rbd::mirror::NamespaceReplayer<librbd::ImageCtx>::handle_shut_down_instance_watcher(int)+0x85) [0x5601aeef8d55]
 6: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0x148) [0x5601aeeb8458]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xe64) [0x7f5ca6ab0ea4]
 8: (ThreadPool::WorkThread::entry()+0x15) [0x7f5ca6ab1705]
 9: (()+0x82de) [0x7f5ca5e002de]
 10: (clone()+0x43) [0x7f5ca436f133]


Related issues 4 (0 open4 closed)

Copied to rbd - Backport #45275: nautilus: [rbd-mirror] image replayer stop might race with remove and instace replayer shut downResolvedMykola GolubActions
Copied to rbd - Backport #45276: mimic: [rbd-mirror] image replayer stop might race with remove and instace replayer shut downRejectedJason DillamanActions
Copied to rbd - Backport #45277: octopus: [rbd-mirror] image replayer stop might race with remove and instace replayer shut downResolvedNathan CutlerActions
Copied to rbd - Bug #45716: [rbd-mirror] image replayer stop might race with remove and instace replayer shut downResolvedJason Dillaman

Actions
Actions #1

Updated by Jason Dillaman almost 6 years ago

  • Description updated (diff)
Actions #2

Updated by Jason Dillaman almost 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman
Actions #3

Updated by Jason Dillaman almost 6 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 34615
Actions #4

Updated by Mykola Golub almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #45275: nautilus: [rbd-mirror] image replayer stop might race with remove and instace replayer shut down added
Actions #6

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #45276: mimic: [rbd-mirror] image replayer stop might race with remove and instace replayer shut down added
Actions #7

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #45277: octopus: [rbd-mirror] image replayer stop might race with remove and instace replayer shut down added
Actions #8

Updated by Jason Dillaman almost 6 years ago

  • Copied to Bug #45716: [rbd-mirror] image replayer stop might race with remove and instace replayer shut down added
Actions #9

Updated by Jason Dillaman almost 6 years ago

Actions #10

Updated by Loïc Dachary over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #11

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 5e540f1191975fac38733e765312295775a52ad4
  • Fixed In set to v16.0.0-880-g5e540f1191
  • Released In set to v16.2.0~2813
  • Upkeep Timestamp set to 2025-07-15T01:14:56+00:00
Actions

Also available in: Atom PDF