Project

General

Profile

Actions

Bug #73857

open

rbd mirror snapshot hang/failure on rocky10

Added by Samuel Just 4 months ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:
Tags:

Description

From @Ilya Dryomov:

Here is a representative job:

http://qa-proxy.ceph.com/teuthology/dis-2025-11-11_19:58:31-rbd-wip-rocky10-branch-of-the-day-2025-11-10-1762829866-distro-default-smithi/8596558/teuthology.log

The test waits for mirror snapshot scheduler that is running inside of rbd_support module to take a mirror snapshot. It shouldn't take more than a minute, but the snapshot is never taken:

rbd mirror snapshot schedule add -p rbd2/ns1 --image test1 1m
...
for i in `seq 12`; do
    test "$(rbd mirror image status rbd2/ns1/test1 |
        grep -c mirror.primary)" -gt '1' && break
    sleep 10
done

test "$(rbd mirror image status rbd2/ns1/test1 |
    grep -c mirror.primary)" -gt '1'

This is test_mirror_snapshot_schedule() in qa/workunits/rbd/cli_generic.sh, but most of it is irrelevant and the reproducer can be reduced to:

$ bin/ceph osd pool create rbd2 8
pool 'rbd2' created
$ bin/rbd pool init rbd2
$ bin/rbd mirror pool enable rbd2 image
$ bin/rbd mirror pool peer add rbd2 cluster1
8fefc0e0-f0b7-464d-8aba-97beb9d1cce0
$ bin/rbd create -s 1 rbd2/test1
$ bin/rbd mirror image enable rbd2/test1 snapshot
Mirroring enabled
$ bin/rbd mirror snapshot schedule add -p rbd2 --image test1 1m

Here is the relevant output right after adding the schedule:

$ bin/rbd mirror image status rbd2/test1 | grep mirror.primary
    3 .mirror.primary.638bb7dd-9353-42b0-b432-116ddafb5dc2.4cf8b088-fdee-4769-a1c9-7b25836f0d51 (peer_uuids:[8fefc0e0-f0b7-464d-8aba-97beb9d1cce0])
$ bin/rbd snap ls --all rbd2/test1
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
     3  .mirror.primary.638bb7dd-9353-42b0-b432-116ddafb5dc2.4cf8b088-fdee-4769-a1c9-7b25836f0d51  1 MiB             Fri Nov 14 16:09:29 2025  mirror (primary peer_uuids:[8fefc0e0-f0b7-464d-8aba-97beb9d1cce0])

$ sleep 90

Here is the relevant output after the mirror schedule “kicks in” – expecting an additional snapshot taken by rbd_support module to show up:

$ bin/rbd mirror image status rbd2/test1 | grep mirror.primary
    3 .mirror.primary.638bb7dd-9353-42b0-b432-116ddafb5dc2.4cf8b088-fdee-4769-a1c9-7b25836f0d51 (peer_uuids:[8fefc0e0-f0b7-464d-8aba-97beb9d1cce0])
    4 .mirror.primary.638bb7dd-9353-42b0-b432-116ddafb5dc2.ccc30565-4f13-4591-b087-2e43ff2a7c63 (peer_uuids:[8fefc0e0-f0b7-464d-8aba-97beb9d1cce0])
$ bin/rbd snap ls --all rbd2/test1
SNAPID  NAME                                                                                       SIZE   PROTECTED  TIMESTAMP                 NAMESPACE                                                         
     3  .mirror.primary.638bb7dd-9353-42b0-b432-116ddafb5dc2.4cf8b088-fdee-4769-a1c9-7b25836f0d51  1 MiB             Fri Nov 14 16:09:29 2025  mirror (primary peer_uuids:[8fefc0e0-f0b7-464d-8aba-97beb9d1cce0])
     4  .mirror.primary.638bb7dd-9353-42b0-b432-116ddafb5dc2.ccc30565-4f13-4591-b087-2e43ff2a7c63  1 MiB             Fri Nov 14 16:11:01 2025  mirror (primary peer_uuids:[8fefc0e0-f0b7-464d-8aba-97beb9d1cce0])

The above output (where the additional snapshot did show up) has been grabbed on a vstart cluster on c9. On r10, the output wouldn’t change because the additional snapshot wouldn’t get created as described below.

Looking at the manager log, there is an apparent hang in src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py:

2025-11-11T20:23:59.999+0000 7fac14dfa6c0  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 9/ns1/2d7be3870ca9
...
2025-11-11T20:23:59.999+0000 7fac14dfa6c0  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 9/ns1/2d7be3870ca9
2025-11-11T20:24:00.009+0000 7fac1b6076c0  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 9/ns1/2d7be3870ca9: r=0
2025-11-11T20:24:00.009+0000 7fac1b6076c0  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 9/ns1/2d7be3870ca9

(9 is the pool ID for rbd2, 2d7be3870ca9 is the image ID for rbd2/ns1/test1)

The image is opened successfully, but something happens to the image.aio_mirror_image_get_info(cb) call there. The cb completion/callback is never called and I have confirmed separately that the corresponding C API (this is a binding for rbd_aio_mirror_image_get_info()) doesn't make it to the second frame, possibly not called at all:

rbd_aio_mirror_image_get_info()
  librbd::api::Mirror<>::image_get_info()

template <typename I>
void Mirror<I>::image_get_info(I *ictx, mirror_image_info_t *mirror_image_info,
                               Context *on_finish) {
  CephContext *cct = ictx->cct;
  ldout(cct, 20) << "ictx=" << ictx << dendl;       <------ not reached

For the body of the binding search for aio_mirror_image_get_info() in src/pybind/rbd/rbd.pyx.

Restarting ceph-mgr daemon doesn’t help – it runs into exactly the same hang after a minute or so according to the configured mirror snapshot schedule. The thread that is making the call does NOT get stuck – it’s seen doing something else later.

I have also confirmed that the binding itself works on Rocky (including calling image.aio_mirror_image_get_info() from the callback of rbd.RBD.aio_open_image() the way it’s done in CreateSnapshotRequests). The thing that I haven't tried is replicating the sub-interpreter setup. This definitely looks like a Python 3.12-specific issue to me, likely something around the global state or the handover from Python to C/C++ world.


Related issues 2 (2 open0 closed)

Related to mgr - Bug #72713: SIGSEGV when ceph-csi-rbd driver schedules RBD volume deletionFix Under Review

Actions
Blocks mgr - Bug #73930: ceph-mgr modules rely on deprecated python subinterpretersNew

Actions
Actions #1

Updated by Samuel Just 4 months ago

  • Description updated (diff)
Actions #2

Updated by Samuel Just 4 months ago

Ok, I see the snapshots showing up on c9.

Actions #3

Updated by Samuel Just 4 months ago · Edited

Shoving rbd_support into the main interpreter actually seems to fix it --

https://github.com/ceph/ceph/pull/66244
mgr main interpreter modules = cephadm,rook,k8sevents,rbd_support

Likely a general problem with librados/librbd python bindings, python 3.12, and submodules.

Actions #4

Updated by Samuel Just 4 months ago

For now, I'm going to update https://github.com/ceph/ceph/pull/66244 to load everything in the main interpreter by default. We can sort out the many sub interpreter issues on a per-module basis later.

Actions #5

Updated by Ilya Dryomov 4 months ago

Sounds good to me. With cephadm -- very likely the most important and widely used module -- needing to run in the main interpreter it doesn't make much sense to be attempting that for other modules.

This goes back years ago, but what was the actual problem that sub-interpreters were supposed to solve?

Actions #6

Updated by Samuel Just 4 months ago

  • Backport set to tentacle
  • Pull request ID set to 66244
Actions #7

Updated by Casey Bodley 4 months ago

  • Blocks Bug #73930: ceph-mgr modules rely on deprecated python subinterpreters added
Actions #8

Updated by Kefu Chai 4 months ago

  • Related to Bug #72713: SIGSEGV when ceph-csi-rbd driver schedules RBD volume deletion added
Actions #9

Updated by Samuel Just about 2 months ago

  • Status changed from New to Fix Under Review
Actions #10

Updated by Yaarit Hatuka about 1 month ago

  • Tags set to rocky10
Actions #11

Updated by Tim Serong about 1 month ago

Ilya Dryomov wrote in #note-5:

This goes back years ago, but what was the actual problem that sub-interpreters were supposed to solve?

It was impossible to have multiple mgr modules using cherrypy at the same time. Here's the original email thread on this topic from back in April 2017:

https://www.spinics.net/lists/ceph-devel/msg36137.html

(I know you already saw this comment on github, but I figured I should add it here too for completeness)

Actions

Also available in: Atom PDF