Bug #68187
opencephfs_mirror: test_mirroring_init_failure_with_recovery failed to get state: "failed"
100%
Description
This happens only in fs:valgrind tests:
2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:====================================================================== 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring) 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_66579315f64c180ea2f8cffbe69b7e2cb8c46860/qa/tasks/cephfs/test_mirroring.py", line 905, in test_mirroring_init_failure_with_recovery 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner: while proceed(): 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_3752d3834a7b6cd13dc17dfaa6c2fd3658f3439a/teuthology/contextutil.py", line 134, in __call__ 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2024-09-18T17:06:50.057 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds
Updated by Jos Collin over 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 60052
Updated by Jos Collin over 1 year ago
- Status changed from Fix Under Review to In Progress
Updated by Jos Collin 12 months ago
- Related to Bug #70515: squid: qa: test_mirroring_init_failure_with_recovery: `teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds` added
Updated by Jos Collin 8 months ago
The mirror daemon seems to have identified the failure (missing cephfs_mirror objects):
2025-07-30T18:06:44.303+0000 13579640 20 cephfs::mirror::Mirror update_fs_mirrors
2025-07-30T18:06:44.303+0000 13579640 5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:06:44.303+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2025-07-30T18:06:44.304+0000 13579640 5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=64, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:06:44.304+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=64, fs_name=cephfs}
However, the asok command fails with:
2025-07-30T18:06:40.355 DEBUG:teuthology.orchestra.run.smithi191:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64 2025-07-30T18:06:40.460 INFO:teuthology.orchestra.run.smithi191.stderr:no valid command found; 10 closest matches: 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:0 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:1 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:2 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:abort 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:assert 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config diff 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config diff get <var> 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config get <var> 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config help [<var>] 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:config set <var> <val>... 2025-07-30T18:06:40.461 INFO:teuthology.orchestra.run.smithi191.stderr:admin_socket: invalid command 2025-07-30T18:06:40.462 DEBUG:teuthology.orchestra.run:got remote process result: 22 2025-07-30T18:06:40.462 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on smithi191 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64'
Updated by Jos Collin 8 months ago
From the logs, the mirror daemon shows restarting.
So we shouldn't expect this command to run successfully and print failed status ('state' == 'failed') during the restarting.
mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@64
Instead, we should catch "invalid command" and continue checking if it would show the failed status, when the restart finishes.
Updated by Jos Collin 8 months ago ยท Edited
The mirror daemon didn't restart it seems.
The client log shows incomplete logs. At the end of ceph-client.mirror.48688.log:
2025-07-30T18:11:04.680+0000 13579640 5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=62, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:11:04.681+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=62, fs_name=cephfs}
2025-07-30T18:11:04.681+0000 13579640 5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=64, fs_name=cephfs} failed mirroring (failed: 1, blocklisted: 0)
2025-07-30T18:11:04.681+0000 13579640 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=64, fs_name=cephfs}
2025-07-30T18:11:04.681+0000 13579640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0xe1c1430) after 2 seconds
2025-07-30T18:11:04.681+0000 12d78640 20 cephfs::mirror::FSMirror ~FSMirror
2025-07-30T18:11:04.681+0000 12d78640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=62, fs_name=cephfs}
2025-07-30T18:11:04.681+0000 12d78640 10 cephfs::mirror::ServiceDaemon: 0x14e0e840 add_or_update_fs_attribute: fscid=62
2025-07-30T18:11:04.682+0000 12d78640 20 cephfs::mirror::FSMirror init
2025-07-30T18:11:04.682+0000 12d78640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2025-07-30T18:11:05.139+0000 12d78640 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.191
2025-07-30T18:11:06.681+0000 13579640 20 cephfs::mirror::Mirror update_fs_mirrors
After the mirror daemon failure, it shows:
2025-07-30T18:06:01.674+0000 12d78640 -1 asok(0x8ebbe10) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2025-07-30T18:06:01.826+0000 12d78640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2025-07-30T18:06:01.896+0000 12d78640 -1 cephfs::mirror::FSMirror init: error accessing local pool (id=124): (2) No such file or directory
2025-07-30T18:06:01.898+0000 12d78640 20 cephfs::mirror::Mirror handle_enable_mirroring: filesystem={fscid=62, fs_name=cephfs}, peers=, r=-2
2025-07-30T18:06:01.898+0000 12d78640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
2025-07-30T18:06:01.898+0000 12d78640 10 cephfs::mirror::ServiceDaemon: 0x14e0e840 add_or_update_fs_attribute: fscid=62
Mirror daemon restart failure:
ceph-client.mirror.48688.log:2025-07-30T18:05:37.851+0000 12d78640 -1 cephfs::mirror::Mirror handle_enable_mirroring: failed to initialize FSMirror for filesystem={fscid=62, fs_name=cephfs}: (2) No such file or directory
Updated by Jos Collin 8 months ago
Today, I've tried this in vstart cluster, where mirroring was enabled:
$ ./bin/ceph mgr module disable mirroring $ ./bin/ceph fs mirror enable a
Wait for sometime and this doesn't display the "state:" in the following command output, which explains why the qa test failed.
jcollin@telecast:~/workspace/ceph/build$ ./bin/ceph --admin-daemon asok/client.admin.33771.asok fs mirror status a@1
{
"rados_inst": "127.0.0.1:0/102182576",
"peers": {
"6557125c-fa18-4e33-aab4-be16d80321bb": {
"remote": {
"client_name": "client.mirror_remote",
"cluster_name": "ceph",
"fs_name": "remotefs"
}
}
},
"snap_dirs": {
"dir_count": 1
}
}
Updated by Jos Collin 8 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID changed from 60052 to 64846