Skip to content

cephfs-mirror: register mirror daemon as service daemon#39408

Merged
batrick merged 4 commits intoceph:masterfrom
vshankar:wip-cephfs-mirror-service-register
Mar 9, 2021
Merged

cephfs-mirror: register mirror daemon as service daemon#39408
batrick merged 4 commits intoceph:masterfrom
vshankar:wip-cephfs-mirror-service-register

Conversation

@vshankar
Copy link
Contributor

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@vshankar vshankar added the cephfs Ceph File System label Feb 11, 2021
@vshankar vshankar requested a review from a team February 11, 2021 08:35
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share an output example of a ceph -s? Also, are we using RADOS's connection ID as a unique identifier?

@vshankar
Copy link
Contributor Author

Can you share an output example of a ceph -s? Also, are we using RADOS's connection ID as a unique identifier?

Right.

Sample from a vstart cluster:

  cluster:
    id:     09cd5c43-36cd-4a3f-92e8-1ebece7a8985
    health: HEALTH_OK

  services:
    mon:           1 daemons, quorum a (age 11m)
    mgr:           x(active, since 11m)
    mds:           cephfs:1 backup_fs:1 {backup_fs:0=b=up:active,cephfs:0=c=up:active} 1 up:standby
    osd:           3 osds: 3 up (since 11m), 3 in (since 2w)
    cephfs-mirror: 1 daemon active (214149)

BTW, I still need to refine the daemon update status meta that is periodically sent to ceph-mgr.

@vshankar vshankar force-pushed the wip-cephfs-mirror-service-register branch from 80be4ff to 3eab93e Compare February 15, 2021 09:36
@vshankar vshankar force-pushed the wip-cephfs-mirror-service-register branch from 3eab93e to 24fd330 Compare February 15, 2021 09:55
@vshankar vshankar changed the title [WIP] cephfs-mirror: register mirror daemon as service daemon cephfs-mirror: register mirror daemon as service daemon Feb 15, 2021
@vshankar
Copy link
Contributor Author

(removed WIP)

tests to follow.

@vshankar vshankar force-pushed the wip-cephfs-mirror-service-register branch from 24fd330 to 2f4a9b7 Compare February 16, 2021 10:05
@github-actions github-actions bot added the tests label Feb 16, 2021
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this patch, will ceph-fs-mirror daemon be reported by the ceph versions command?

@vshankar
Copy link
Contributor Author

With this patch, will ceph-fs-mirror daemon be reported by the ceph versions command?

Good question. I think it does. Service daemons are dumped here: https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L3918

@vshankar vshankar force-pushed the wip-cephfs-mirror-service-register branch from 2f4a9b7 to e14e8cd Compare February 24, 2021 04:13
@vshankar vshankar mentioned this pull request Mar 3, 2021
3 tasks
@leseb
Copy link
Member

leseb commented Mar 3, 2021

@batrick @vshankar anything missing here, Rook would like to try it out :). Thanks.

"client.mirror_remote@ceph", '/d0', 'snap0', 1)
self.disable_mirroring(self.primary_fs_name, self.primary_fs_id)

def test_cephfs_mirror_service_daemon_status(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021-03-04T11:44:34.995 INFO:tasks.cephfs_test_runner:======================================================================
2021-03-04T11:44:34.996 INFO:tasks.cephfs_test_runner:FAIL: test_cephfs_mirror_service_daemon_status (tasks.cephfs.test_mirroring.TestMirroring)
2021-03-04T11:44:34.996 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-03-04T11:44:34.996 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-03-04T11:44:34.997 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_d5875db5f1fdb7ac11e6c9da991269513ffdac1a/qa/tasks/cephfs/test_mirroring.py", line 678, in test_cephfs_mirror_service_daemon_status
2021-03-04T11:44:34.997 INFO:tasks.cephfs_test_runner:    self.assertEquals(peer_stats['failure_count'], 1)
2021-03-04T11:44:34.997 INFO:tasks.cephfs_test_runner:AssertionError: 0 != 1

From: /ceph/teuthology-archive/pdonnell-2021-03-04_03:51:01-fs-wip-pdonnell-testing-20210303.195715-distro-basic-smithi/5932070/teuthology.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a timing thing. The mirror daemon marks a directory as "failed" after it has tried synchronization (w/ failure) seeing "N" consecutive failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From logs:

2021-03-04T11:43:04.557+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:14.558+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:24.560+0000 7f3c0ba97700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:34.562+0000 7f3c0ba97700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:44.565+0000 7f3c0b296700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:43:54.566+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:44:04.568+0000 7f3c0c298700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0
2021-03-04T11:44:14.569+0000 7f3c0b296700 -1 cephfs::mirror::PeerReplayer(efca7e05-a062-4eb6-99e5-d555726d6052) sync_snaps: failed to sync snapshots for dir_path=/d0

For peer uuid: efca7e05-a062-4eb6-99e5-d555726d6052, the mirror daemon has not yet hit the consecutive failure count (default: 10).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a safe_while to test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided that since it repeatedly runs a command (although the delay is configurable). We already kind of know the expected wait time here before a directory is marked as failed.

I can convert all the wait calls to use safe_while in another PR since it requires an audit of the tests.

vshankar added 4 commits March 5, 2021 01:00
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar vshankar force-pushed the wip-cephfs-mirror-service-register branch from e14e8cd to c0632f6 Compare March 5, 2021 06:00
@batrick
Copy link
Member

batrick commented Mar 6, 2021

@vshankar ready for another round of QA?

@vshankar
Copy link
Contributor Author

vshankar commented Mar 7, 2021

@vshankar ready for another round of QA?

yes please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants