Actions
Bug #72692
closeddeadlock in MonClient if a version request (e.g. waiting for the latest osdmap) races with a session reset
% Done:
0%
Source:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
v20.3.0-2528-g4413740e84
Released In:
Upkeep Timestamp:
2025-08-22T19:02:21+00:00
Description
Surfaced in RBD as random nonsensical hangs, mostly in rbd-mirror daemon -- owning to the fact that it calls Rados::wait_for_latest_osdmap() every 30 seconds by default. Two most common failure modes in rbd-mirror.sh workunit are that rbd-mirror daemon doesn't react to SIGTERM signal and that the information about the service isn't displayed in "rbd mirror image status" output:
2025-08-20T22:36:23.679 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:++ cat /var/run/ceph/rbd-mirror.cluster1-client.mirror.2.pid
2025-08-20T22:36:23.680 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ pid=45476
2025-08-20T22:36:23.680 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ '[' -n 45476 ']'
2025-08-20T22:36:23.680 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ kill 45476
2025-08-20T22:36:23.681 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32
2025-08-20T22:36:23.681 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 1
2025-08-20T22:36:24.681 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:36:24.682 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:24.690 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63496 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:36:24.690 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32
2025-08-20T22:36:24.690 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 2
2025-08-20T22:36:26.692 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:36:26.692 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:26.700 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63628 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:36:26.701 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32
2025-08-20T22:36:26.701 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 4
2025-08-20T22:36:30.703 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:36:30.703 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:30.711 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63628 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:36:30.712 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32
2025-08-20T22:36:30.712 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 8
2025-08-20T22:36:38.713 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:36:38.714 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:38.722 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63628 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:36:38.722 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32
2025-08-20T22:36:38.722 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 16
2025-08-20T22:36:54.724 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:36:54.724 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:54.732 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63676 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:36:54.733 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32
2025-08-20T22:36:54.733 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 32
2025-08-20T22:37:26.735 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:37:26.736 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:37:26.743 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63696 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:37:26.743 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww
2025-08-20T22:37:26.744 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:37:26.751 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu 45476 0.1 0.1 734376 63696 ? Ssl 22:20 0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true
2025-08-20T22:37:26.753 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ cleanup 1
http://qa-proxy.ceph.com/teuthology/yuriw-2025-08-20_21:51:25-rbd-wip-yuri-main-baseline2-distro-default-smithi/8455442/teuthology.log
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ wait_for_status_in_pool_dir cluster2 mirror test up+unknown 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local cluster=cluster2 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local pool=mirror 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local image=test 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local state_pattern=up+unknown 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local description_pattern= 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local service_pattern= 2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ for s in 1 2 4 8 8 8 8 8 8 8 8 16 16 2025-08-20T22:22:41.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ sleep 1 2025-08-20T22:22:42.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ test_status_in_pool_dir cluster2 mirror test up+unknown '' '' 2025-08-20T22:22:42.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local cluster=cluster2 2025-08-20T22:22:42.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local pool=mirror 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local image=test 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local state_pattern=up+unknown 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local description_pattern= 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local service_pattern= 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local status 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:++ CEPH_ARGS= 2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:++ rbd --cluster cluster2 mirror image status mirror/test 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ status='test: 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: global_id: 903f6ef0-6770-4603-a918-dc3cdc3a1f96 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: state: up+unknown 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: description: remote image is not primary 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: last_update: 2025-08-20 22:22:41 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: peer_sites: 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: name: cluster1-DNE 2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: state: up+unknown 2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: description: remote image is not primary 2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr: last_update: 2025-08-20 22:22:41' 2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep '^ state: .*up+unknown' 2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stdout: state: up+unknown 2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep '^ description: .*' 2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stdout: description: remote image is not primary 2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ '[' -n '' ']' 2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ echo up+unknown 2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep '^up+' 2025-08-20T22:22:42.426 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stdout:up+unknown 2025-08-20T22:22:42.426 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep 'service: *mirror..* on ' 2025-08-20T22:22:42.428 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ return 1
This turned out to be a regression introduced in https://tracker.ceph.com/issues/70882. The body of the initiating function inside of MonClient::get_version() attempts to acquire monc_lock while it's already held by MonClient::ms_handle_reset().
Updated by Ilya Dryomov 7 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 65177
Updated by Ilya Dryomov 7 months ago
- Related to Bug #70882: multisite: datalog listing asserts on out-of-bounds --shard-id added
Updated by Upkeep Bot 7 months ago
- Status changed from Fix Under Review to Pending Backport
- Merge Commit set to 4413740e8494a72c24cf11a43d753e0cba4dee54
- Fixed In set to v20.3.0-2528-g4413740e84
- Upkeep Timestamp set to 2025-08-22T19:02:21+00:00
Updated by Ilya Dryomov 7 months ago
- Status changed from Pending Backport to Resolved
- Backport deleted (
tentacle)
Backport is to be appended to https://github.com/ceph/ceph/pull/64898.
Actions