Bug #51964
closedqa: test_cephfs_mirror_restart_sync_on_blocklist failure
0%
Description
2021-07-29T07:47:34.804 INFO:tasks.cephfs_test_runner:======================================================================
2021-07-29T07:47:34.804 INFO:tasks.cephfs_test_runner:ERROR: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring)
2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 627, in test_cephfs_mirror_restart_sync_on_blocklist
2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner: "client.mirror_remote@ceph", '/d0', 'snap0', expected_snap_count=1)
2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 142, in check_peer_status
2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner: f'{fs_name}@{fs_id}', peer_uuid)
2021-07-29T07:47:34.807 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 231, in mirror_daemon_command
2021-07-29T07:47:34.807 INFO:tasks.cephfs_test_runner: check_status=True, label=cmd_label)
2021-07-29T07:47:34.808 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/remote.py", line 509, in run
2021-07-29T07:47:34.808 INFO:tasks.cephfs_test_runner: r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2021-07-29T07:47:34.809 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 455, in run
2021-07-29T07:47:34.809 INFO:tasks.cephfs_test_runner: r.wait()
2021-07-29T07:47:34.810 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 161, in wait
2021-07-29T07:47:34.811 INFO:tasks.cephfs_test_runner: self._raise_for_status()
2021-07-29T07:47:34.811 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 183, in _raise_for_status
2021-07-29T07:47:34.812 INFO:tasks.cephfs_test_runner: node=self.hostname, label=self.label
2021-07-29T07:47:34.812 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed (peer status for fs: cephfs) on smithi042 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror peer status cephfs@30 96c863b2-6936-4726-a484-146c9382a70b'
From: /ceph/teuthology-archive/pdonnell-2021-07-29_05:54:43-fs-wip-pdonnell-testing-20210729.025313-distro-basic-smithi/6300469/teuthology.log
Updated by Patrick Donnelly over 4 years ago
/ceph/teuthology-archive/pdonnell-2021-11-04_15:43:53-fs-wip-pdonnell-testing-20211103.023355-distro-basic-smithi/6485100/teuthology.log
Updated by Kotresh Hiremath Ravishankar over 4 years ago
Also seen in this run
/ceph/teuthology-archive/yuriw-2021-11-08_15:19:37-fs-wip-yuri2-testing-2021-11-06-1322-pacific-distro-basic-smithi/6491186/teuthology.log
Updated by Venky Shankar about 4 years ago
The mirror daemon is stuck at mounting the file system::
021-12-22T16:35:57.556+0000 10697700 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2021-12-22T16:35:57.894+0000 10697700 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.164
2021-12-22T16:35:58.029+0000 10697700 -1 asok(0xa0c6050) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2021-12-22T16:35:58.202+0000 10697700 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2021-12-22T16:35:58.224+0000 10697700 20 cephfs::mirror::Utils mount: filesystem={fscid=30, fs_name=cephfs}
2021-12-22T16:35:58.558+0000 14ea0700 20 cephfs::mirror::ServiceDaemon: 0xfc08d10 update_status: 1 filesystem(s)
2021-12-22T16:35:58.610+0000 10e98700 20 cephfs::mirror::Mirror update_fs_mirrors
... ceph_mount() never returned - will rerun with "debug client = 20".
Updated by Venky Shankar about 4 years ago
Unable to hit this consistently: https://pulpito.ceph.com/vshankar-2022-01-05_05:43:45-fs-wip-vshankar-testing-20220103-142738-testing-default-smithi/
We should probably include "debug client: 20" for cephfs-mirror tests and so that we have debug logs to look at when we hit this again.
Updated by Kotresh Hiremath Ravishankar over 3 years ago
Updated by Rishabh Dave over 3 years ago
Updated by Kotresh Hiremath Ravishankar over 3 years ago
Updated by Venky Shankar over 3 years ago
- Category set to Testing
- Priority changed from High to Low
- Target version set to v18.0.0
- Backport changed from pacific to pacific,quincy
- Labels (FS) qa added
Lowering priority -- this is an issue with the test case rather than a bug in cephfs-mirror.
Updated by Laura Flores about 3 years ago
- Has duplicate Bug #58656: qa: Test failure: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring) added
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7221904
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7206976
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227398
Updated by Laura Flores almost 3 years ago
- Backport changed from pacific,quincy to pacific,quincy.reef
Updated by Venky Shankar over 2 years ago
- Assignee changed from Venky Shankar to Jos Collin
Jos, please take this one.
Updated by Jos Collin over 2 years ago
@Venky:
This bug couldn't be reproduced on main with consecutive runs of test_cephfs_mirror_restart_sync_on_blocklist.
10 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-31_05:17:33-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/
100 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-07_09:13:25-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/
Couldn't reproduce it on a full tasks.cephfs.test_mirroring.TestMirroring run (25 runs consecutively):
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-08_02:10:15-fs:mirror-main-distro-default-smithi/
Updated by Venky Shankar over 2 years ago
Jos Collin wrote:
@Venky:
This bug couldn't be reproduced on main with consecutive runs of test_cephfs_mirror_restart_sync_on_blocklist.
10 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-31_05:17:33-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/100 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-07_09:13:25-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/Couldn't reproduce it on a full tasks.cephfs.test_mirroring.TestMirroring run (25 runs consecutively):
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-08_02:10:15-fs:mirror-main-distro-default-smithi/
The failures are intermittent. The latest failed test run is - https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/
We should be able to get some hints from this run.
Updated by Jos Collin over 2 years ago
In the run https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/, as talked over gchat, please check ceph-client.mirror.95060.log file. After the client blocklisted, the filesystem (fscid=12, fs_name=cephfs) got removed showing "update_status: 0 filesystem(s)".
1538 2023-07-26T05:53:49.469+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=12, fs_name=cephfs}]
1539 2023-07-26T05:53:49.469+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 remove_filesystem: fscid=12
1540 2023-07-26T05:53:49.473+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status
1541 2023-07-26T05:53:49.475+0000 8cd4640 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=12, fs_name=cephfs}
1542 2023-07-26T05:53:49.475+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
1543 2023-07-26T05:53:49.696+0000 189b1640 20 cephfs::mirror::Watcher handle_notify: notify_id=652835029106, handle=326732576, notifier_id=6609
1544 2023-07-26T05:53:49.696+0000 189b1640 20 cephfs::mirror::MirrorWatcher handle_notify
1545 2023-07-26T05:53:50.473+0000 110e4640 20 cephfs::mirror::ServiceDaemon: 0x8fe9d50 update_status: 0 filesystem(s)
This is the weird thing noticed.
Then after "scheduling fs mirror update", the filesystem `cephfs` comes back and mirroring enabled with a new fscid:
1817 2023-07-26T05:57:02.570+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=14, fs_name=cephfs}], mirroring_disabled=[]
1818 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_filesystem: fscid=14, fs_name=cephfs
1819 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status
1820 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=14, fs_name=cephfs}, pool_id=29
1821 2023-07-26T05:57:02.570+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
1822 2023-07-26T05:57:02.925+0000 d0dc640 20 cephfs::mirror::Mirror update_fs_mirrors
1823 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=14, fs_name=cephfs}
1824 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_or_update_fs_attribute: fscid=14
1825 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status
1826 2023-07-26T05:57:02.925+0000 d0dc640 20 cephfs::mirror::FSMirror init
The test doesn't seem passing `check_peer_status()`.
Updated by Venky Shankar over 2 years ago
Hi Jos,
Jos Collin wrote:
In the run https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/, as talked over gchat, please check ceph-client.mirror.95060.log file. After the client blocklisted, the filesystem (fscid=12, fs_name=cephfs) got removed showing "update_status: 0 filesystem(s)".
I don't think this is the issue although I suspected it.
[...]
This is the weird thing noticed.
Then after "scheduling fs mirror update", the filesystem `cephfs` comes back and mirroring enabled with a new fscid:
[...]The test doesn't seem passing `check_peer_status()`.
The actual issue is this
2023-07-26T06:46:58.510+0000 d0dc640 20 cephfs::mirror::FSMirror shutdown_mirror_watcher
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::MirrorWatcher shutdown
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::MirrorWatcher unregister_watcher
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::Watcher unregister_watch
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x72287f0) after 2 seconds
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::MirrorWatcher handle_unregister_watcher: r=0
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::FSMirror handle_shutdown_mirror_watcher: r=0
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::FSMirror shutdown_instance_watcher
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher shutdown
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher unregister_watcher
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::Watcher unregister_watch
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher handle_unregister_watcher: r=0
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher remove_instance
2023-07-26T06:46:58.516+0000 189b1640 20 cephfs::mirror::InstanceWatcher handle_remove_instance: r=-108
2023-07-26T06:46:58.516+0000 c8db640 20 cephfs::mirror::FSMirror handle_shutdown_instance_watcher: r=-108
2023-07-26T06:46:58.516+0000 c8db640 20 cephfs::mirror::FSMirror cleanup
2023-07-26T06:46:59.595+0000 c8db640 20 cephfs::mirror::FSMirror ~FSMirror
2023-07-26T06:46:59.595+0000 c8db640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=30, fs_name=cephfs}
2023-07-26T06:46:59.596+0000 c8db640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_or_update_fs_attribute: fscid=30
2023-07-26T06:46:59.596+0000 c8db640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status
2023-07-26T06:46:59.596+0000 c8db640 20 cephfs::mirror::FSMirror init
2023-07-26T06:46:59.596+0000 c8db640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2023-07-26T06:46:59.955+0000 c8db640 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.162
2023-07-26T06:47:00.080+0000 c8db640 -1 asok(0x6b581f0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2023-07-26T06:47:00.259+0000 c8db640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2023-07-26T06:47:00.317+0000 c8db640 20 cephfs::mirror::Utils mount: filesystem={fscid=30, fs_name=cephfs}
2023-07-26T06:47:00.553+0000 d0dc640 20 cephfs::mirror::Mirror update_fs_mirrors
2023-07-26T06:47:00.604+0000 110e4640 20 cephfs::mirror::ServiceDaemon: 0x8fe9d50 update_status: 1 filesystem(s)
Post blocklisting, the mirror daemon tried to reinitialize and got stuck in mounting the (local) file system. Could you check why that happened?
Updated by Jos Collin almost 2 years ago
- Status changed from In Progress to Fix Under Review
- Priority changed from Low to Normal
- Pull request ID set to 56193
Updated by Venky Shankar over 1 year ago
- Category changed from Testing to Correctness/Safety
- Status changed from Fix Under Review to Pending Backport
- Target version set to v20.0.0
- Backport changed from pacific,quincy.reef to quincy,reef,squid
Updated by Jos Collin over 1 year ago
- Copied to Backport #66975: squid: qa: test_cephfs_mirror_restart_sync_on_blocklist failure added
Updated by Jos Collin over 1 year ago
- Copied to Backport #66976: reef: qa: test_cephfs_mirror_restart_sync_on_blocklist failure added
Updated by Jos Collin over 1 year ago
- Copied to Backport #66978: quincy: qa: test_cephfs_mirror_restart_sync_on_blocklist failure added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Jos Collin over 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 62eb72731aca5d403ed6239946c6ea66f3be36e7
- Fixed In set to v19.3.0-3451-g62eb72731a
- Upkeep Timestamp set to 2025-07-14T17:43:02+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2488
- Upkeep Timestamp changed from 2025-07-14T17:43:02+00:00 to 2025-11-01T01:28:36+00:00