cephfs-mirror: register mirror daemon as service daemon#39408
cephfs-mirror: register mirror daemon as service daemon#39408batrick merged 4 commits intoceph:masterfrom
Conversation
leseb
left a comment
There was a problem hiding this comment.
Can you share an output example of a ceph -s? Also, are we using RADOS's connection ID as a unique identifier?
Right. Sample from a vstart cluster: BTW, I still need to refine the daemon update status meta that is periodically sent to ceph-mgr. |
80be4ff to
3eab93e
Compare
3eab93e to
24fd330
Compare
|
(removed WIP) tests to follow. |
24fd330 to
2f4a9b7
Compare
leseb
left a comment
There was a problem hiding this comment.
With this patch, will ceph-fs-mirror daemon be reported by the ceph versions command?
Good question. I think it does. Service daemons are dumped here: https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L3918 |
2f4a9b7 to
e14e8cd
Compare
| "client.mirror_remote@ceph", '/d0', 'snap0', 1) | ||
| self.disable_mirroring(self.primary_fs_name, self.primary_fs_id) | ||
|
|
||
| def test_cephfs_mirror_service_daemon_status(self): |
There was a problem hiding this comment.
2021-03-04T11:44:34.995 INFO:tasks.cephfs_test_runner:======================================================================
2021-03-04T11:44:34.996 INFO:tasks.cephfs_test_runner:FAIL: test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring)
2021-03-04T11:44:34.996 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-03-04T11:44:34.996 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-03-04T11:44:34.997 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_d5875db5f1fdb7ac11e6c9da991269513ffdac1a/qa/tasks/cephfs/test_mirroring.py", line 678, in test_cephfs_mirror_service_daemon_status
2021-03-04T11:44:34.997 INFO:tasks.cephfs_test_runner: self.assertEquals(peer_stats['failure_count'], 1)
2021-03-04T11:44:34.997 INFO:tasks.cephfs_test_runner:AssertionError: 0 != 1
From: /ceph/teuthology-archive/pdonnell-2021-03-04_03:51:01-fs-wip-pdonnell-testing-20210303.195715-distro-basic-smithi/5932070/teuthology.log
There was a problem hiding this comment.
It's a timing thing. The mirror daemon marks a directory as "failed" after it has tried synchronization (w/ failure) seeing "N" consecutive failures.
There was a problem hiding this comment.
From logs:
2021-03-04T11:43:04.557+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:14.558+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:24.560+0000 7f3c0ba97700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:34.562+0000 7f3c0ba97700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:44.565+0000 7f3c0b296700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:54.566+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:44:04.568+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:44:14.569+0000 7f3c0b296700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
For peer uuid: efca7e05-a062-4eb6-99e5-d555726d6052, the mirror daemon has not yet hit the consecutive failure count (default: 10).
There was a problem hiding this comment.
I avoided that since it repeatedly runs a command (although the delay is configurable). We already kind of know the expected wait time here before a directory is marked as failed.
I can convert all the wait calls to use safe_while in another PR since it requires an audit of the tests.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Fixes: http://tracker.ceph.com/issues/48943 Signed-off-by: Venky Shankar <vshankar@redhat.com>
e14e8cd to
c0632f6
Compare
|
@vshankar ready for another round of QA? |
yes please. |
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox