Project

General

Profile

Actions

Bug #72692

closed

deadlock in MonClient if a version request (e.g. waiting for the latest osdmap) races with a session reset

Added by Ilya Dryomov 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Tags (freeform):
Fixed In:
v20.3.0-2528-g4413740e84
Released In:
Upkeep Timestamp:
2025-08-22T19:02:21+00:00

Description

Surfaced in RBD as random nonsensical hangs, mostly in rbd-mirror daemon -- owning to the fact that it calls Rados::wait_for_latest_osdmap() every 30 seconds by default. Two most common failure modes in rbd-mirror.sh workunit are that rbd-mirror daemon doesn't react to SIGTERM signal and that the information about the service isn't displayed in "rbd mirror image status" output:

http://qa-proxy.ceph.com/teuthology/yuriw-2025-08-20_21:51:25-rbd-wip-yuri-main-baseline2-distro-default-smithi/8455438/teuthology.log

2025-08-20T22:36:23.679 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:++ cat /var/run/ceph/rbd-mirror.cluster1-client.mirror.2.pid
2025-08-20T22:36:23.680 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ pid=45476                                                                                                                                  
2025-08-20T22:36:23.680 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ '[' -n 45476 ']'                                                
2025-08-20T22:36:23.680 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ kill 45476                                                                                                                                                                            
2025-08-20T22:36:23.681 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32                                          
2025-08-20T22:36:23.681 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 1                                                         
2025-08-20T22:36:24.681 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww                                                        
2025-08-20T22:36:24.682 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'                          
2025-08-20T22:36:24.690 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63496 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                                                                                                                                 
2025-08-20T22:36:24.690 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32        
2025-08-20T22:36:24.690 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 2
2025-08-20T22:36:26.692 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww                                                                  
2025-08-20T22:36:26.692 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:26.700 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63628 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                                            
2025-08-20T22:36:26.701 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32                                                                                                     
2025-08-20T22:36:26.701 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 4
2025-08-20T22:36:30.703 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww                                                                                                                                                                                                                                                    
2025-08-20T22:36:30.703 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'                                                                               
2025-08-20T22:36:30.711 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63628 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                                                                                                                 
2025-08-20T22:36:30.712 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32                                                                                     
2025-08-20T22:36:30.712 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 8                                                                
2025-08-20T22:36:38.713 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww                                                                                                   
2025-08-20T22:36:38.714 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:38.722 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63628 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                                                                                                                                                                                                                                        
2025-08-20T22:36:38.722 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32                                                                                                                                                                                                                            
2025-08-20T22:36:38.722 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 16                                              
2025-08-20T22:36:54.724 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww   
2025-08-20T22:36:54.724 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'
2025-08-20T22:36:54.732 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63676 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                 
2025-08-20T22:36:54.733 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ for s in 1 2 4 8 16 32                             
2025-08-20T22:36:54.733 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ sleep 32                                                                  
2025-08-20T22:37:26.735 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww                                           
2025-08-20T22:37:26.736 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'                                                                                                                      
2025-08-20T22:37:26.743 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63696 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                                                                                                                                                        
2025-08-20T22:37:26.743 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ ps auxww                                                                                             
2025-08-20T22:37:26.744 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ awk -v pid=45476 '$2 == pid {print; exit 1}'                   
2025-08-20T22:37:26.751 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stdout:ubuntu     45476  0.1  0.1 734376 63696 ?        Ssl  22:20   0:01 rbd-mirror --cluster cluster1 --id mirror.2 --rbd-mirror-delete-retry-interval=5 --rbd-mirror-image-state-check-interval=5 --rbd-mirror-journal-poll-age=1 --rbd-mirror-pool-replayers-refresh-interval=5 --debug-rbd=30 --debug-journaler=30 --debug-rbd_mirror=30 --daemonize=true                                                                                               
2025-08-20T22:37:26.753 INFO:tasks.workunit.cluster1.client.mirror.smithi079.stderr:+ cleanup 1  

http://qa-proxy.ceph.com/teuthology/yuriw-2025-08-20_21:51:25-rbd-wip-yuri-main-baseline2-distro-default-smithi/8455442/teuthology.log
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ wait_for_status_in_pool_dir cluster2 mirror test up+unknown
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local cluster=cluster2
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local pool=mirror
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local image=test
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local state_pattern=up+unknown
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local description_pattern=
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local service_pattern=
2025-08-20T22:22:41.364 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ for s in 1 2 4 8 8 8 8 8 8 8 8 16 16
2025-08-20T22:22:41.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ sleep 1
2025-08-20T22:22:42.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ test_status_in_pool_dir cluster2 mirror test up+unknown '' ''
2025-08-20T22:22:42.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local cluster=cluster2
2025-08-20T22:22:42.365 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local pool=mirror
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local image=test
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local state_pattern=up+unknown
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local description_pattern=
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local service_pattern=
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ local status
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:++ CEPH_ARGS=
2025-08-20T22:22:42.366 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:++ rbd --cluster cluster2 mirror image status mirror/test
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ status='test:
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:  global_id:   903f6ef0-6770-4603-a918-dc3cdc3a1f96
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:  state:       up+unknown
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:  description: remote image is not primary
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:  last_update: 2025-08-20 22:22:41
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:  peer_sites:
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:    name: cluster1-DNE
2025-08-20T22:22:42.423 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:    state: up+unknown
2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:    description: remote image is not primary
2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:    last_update: 2025-08-20 22:22:41'
2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep '^  state: .*up+unknown'
2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stdout:  state:       up+unknown
2025-08-20T22:22:42.424 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep '^  description: .*'
2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stdout:  description: remote image is not primary
2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ '[' -n '' ']'
2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ echo up+unknown
2025-08-20T22:22:42.425 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep '^up+'
2025-08-20T22:22:42.426 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stdout:up+unknown
2025-08-20T22:22:42.426 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ grep 'service: *mirror..* on '
2025-08-20T22:22:42.428 INFO:tasks.workunit.cluster1.client.mirror.smithi184.stderr:+ return 1

This turned out to be a regression introduced in https://tracker.ceph.com/issues/70882. The body of the initiating function inside of MonClient::get_version() attempts to acquire monc_lock while it's already held by MonClient::ms_handle_reset().


Related issues 1 (0 open1 closed)

Related to rgw - Bug #70882: multisite: datalog listing asserts on out-of-bounds --shard-idResolvedAdam Emerson

Actions
Actions #1

Updated by Ilya Dryomov 7 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 65177
Actions #2

Updated by Ilya Dryomov 7 months ago

  • Related to Bug #70882: multisite: datalog listing asserts on out-of-bounds --shard-id added
Actions #3

Updated by Upkeep Bot 7 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Merge Commit set to 4413740e8494a72c24cf11a43d753e0cba4dee54
  • Fixed In set to v20.3.0-2528-g4413740e84
  • Upkeep Timestamp set to 2025-08-22T19:02:21+00:00
Actions #4

Updated by Ilya Dryomov 7 months ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (tentacle)

Backport is to be appended to https://github.com/ceph/ceph/pull/64898.

Actions

Also available in: Atom PDF